GPU节点接入k8s集群的处理

文章目录

  • [1. 安装 nvidia-container-runtime(接入k8s前)](#1. 安装 nvidia-container-runtime(接入k8s前))
  • [2. 其他插件](#2. 其他插件)
    • [2.1 nvidia-fabricmanager](#2.1 nvidia-fabricmanager)
    • [2.2 datacenter-gpu-manager.tar.gz](#2.2 datacenter-gpu-manager.tar.gz)
  • [3. 节点接入K8S集群](#3. 节点接入K8S集群)
  • [4. 安装NVIDIA插件](#4. 安装NVIDIA插件)

1. 安装 nvidia-container-runtime(接入k8s前)

1. 1 安装

  • 下边这些包,我的ubuntu系统
shell 复制代码
root@liubei:~/nvidia-container-runtime# ll
total 7368
drwxr-xr-x 2 root root    4096 Nov 14 09:28 ./
drwx------ 8 root root    4096 Nov 14 13:11 ../
-rw-r--r-- 1 root root   20068 Jun 29  2023 gcc-12-base_12.3.0-1ubuntu1~22.04_amd64.deb
-rw-r--r-- 1 root root 3235218 May 29 16:10 libc6_2.35-0ubuntu3.8_amd64.deb
-rw-r--r-- 1 root root   18334 Jun 14  2023 libcap2_1%3a2.44-1ubuntu0.22.04.1_amd64.deb
-rw-r--r-- 1 root root   82004 Dec 18  2021 libcrypt1_1%3a4.4.27-1_amd64.deb
-rw-r--r-- 1 root root   53888 Jun 29  2023 libgcc-s1_12.3.0-1ubuntu1~22.04_amd64.deb
-rw-r--r-- 1 root root  928036 Nov  9 02:43 libnvidia-container1_1.13.5-1_amd64.deb
-rw-r--r-- 1 root root   24908 Nov  9 02:43 libnvidia-container-tools_1.13.5-1_amd64.deb
-rw-r--r-- 1 root root   47424 Mar 17  2022 libseccomp2_2.5.3-2ubuntu2_amd64.deb
-rw-r--r-- 1 root root    4988 Nov  9 02:43 nvidia-container-runtime_3.13.0-1_all.deb
-rw-r--r-- 1 root root  853436 Nov  9 02:43 nvidia-container-toolkit_1.13.5-1_amd64.deb
-rw-r--r-- 1 root root 2244092 Nov  9 02:43 nvidia-container-toolkit-base_1.13.5-1_amd64.deb
  • 安装
shell 复制代码
dpkg -i *.deb

1.2 使用

docker服务使用它

  • 修改/etc/docker/daemon.json文件,内容如下:
json 复制代码
{
    "exec-opts": ["native.cgroupdriver=cgroupfs"],
    "default-runtime":"nvidia",
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}
  • 查看结果
shell 复制代码
root@liubei:/# docker info|grep nvidia
 Runtimes: runc io.containerd.runc.v2 nvidia
 Default Runtime: nvidia

containerd服务使用它

修改/etc/containerd/config.toml 文件中:

shell 复制代码
BinaryName = "nvidia-container-runtime"

为了便于找到,贴出整段内容:

shell 复制代码
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            BinaryName = "nvidia-container-runtime"
            CriuImagePath = ""
            CriuPath = ""
            CriuWorkPath = ""
            IoGid = 0
            IoUid = 0
            NoNewKeyring = false
            NoPivotRoot = false
            Root = ""
            ShimCgroup = ""
            SystemdCgroup = true

还用注意:SystemdCgroup = true,如果你之前安装containerd没有修改,这里一定要改过来。

  • 查看结果
shell 复制代码
root@cto-gpu-pro-n01:/opt/cni/bin# crictl info |grep nvidia
            "BinaryName": "nvidia-container-runtime",

2. 其他插件

2.1 nvidia-fabricmanager

  • 我缓存在本地的包
shell 复制代码
root@liubei:nvidia-fabricmanager-dev-565 # ll
总用量 9552
-rw-r----- 1 root root   20068 12月  5 10:13 gcc-12-base_12.3.0-1ubuntu1~22.04_amd64.deb
-rw-r----- 1 root root 3235218 12月  5 10:13 libc6_2.35-0ubuntu3.8_amd64.deb
-rw-r----- 1 root root   82004 12月  5 10:13 libcrypt1_1%3a4.4.27-1_amd64.deb
-rw-r----- 1 root root   53888 12月  5 10:13 libgcc-s1_12.3.0-1ubuntu1~22.04_amd64.deb
-rw-r----- 1 root root 5793340 12月  5 10:13 nvidia-fabricmanager-565_565.57.01-1_amd64.deb
-rw-r----- 1 root root  584610 12月  5 10:13 nvidia-fabricmanager-dev-565_565.57.01-1_amd64.deb
  • 安装
shell 复制代码
dpkg -i *.deb
  • 安装完成后会自动启动
shell 复制代码
root@boe:~# systemctl status  nvidia-fabricmanager
● nvidia-fabricmanager.service - NVIDIA fabric manager service
     Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2024-11-25 00:48:32 UTC; 1 month 0 days ago
   Main PID: 1469962 (nv-fabricmanage)
      Tasks: 18 (limit: 629145)
     Memory: 5.2M
        CPU: 32min 31.547s
     CGroup: /system.slice/nvidia-fabricmanager.service
             └─1469962 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg

Nov 25 00:48:30 boe systemd[1]: Starting NVIDIA fabric manager service...
Nov 25 00:48:31 boe nv-fabricmanager[1469962]: Connected to 1 node.
Nov 25 00:48:32 boe nv-fabricmanager[1469962]: Successfully configured all the available NVSwitches to route GPU NVLink traffic. NVLink Peer-to-Peer support will be enabled once the GPUs are success>
Nov 25 00:48:32 boe systemd[1]: Started NVIDIA fabric manager service.
lines 1-14/14 (END)

2.2 datacenter-gpu-manager.tar.gz

  • 我缓存在本地的包
shell 复制代码
root@liubei:datacenter-gpu-manager # ll
总用量 893172
-rw-r--r-- 1 root root 911205922 11月 14 09:01 datacenter-gpu-manager_1%3a3.3.9_amd64.deb
-rw-r--r-- 1 root root     20068 6月  29 2023 gcc-12-base_12.3.0-1ubuntu1~22.04_amd64.deb
-rw-r--r-- 1 root root   3235218 5月  30 2024 libc6_2.35-0ubuntu3.8_amd64.deb
-rw-r--r-- 1 root root     82004 12月 18 2021 libcrypt1_1%3a4.4.27-1_amd64.deb
-rw-r--r-- 1 root root     53888 6月  29 2023 libgcc-s1_12.3.0-1ubuntu1~22.04_amd64.deb
  • 安装
shell 复制代码
dpkg -i *.deb

3. 节点接入K8S集群

操作同普通cpu节点4

4. 安装NVIDIA插件

  • 创建device-plugin.yml文件内容如下:
yml 复制代码
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      creationTimestamp: null
      labels:
        name: nvidia-device-plugin-ds
    spec:
      containers:
      - image: harbocto.xxx.com.cn/bdteam/k8s-device-plugin:1.11
        imagePullPolicy: IfNotPresent
        name: nvidia-device-plugin-ctr
        resources: {}
        securityContext:
          capabilities:
            drop:
            - ALL
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/kubelet/device-plugins
          name: device-plugin
      dnsPolicy: ClusterFirst
      nodeSelector:
        gpu: gpu #插件会启动在有这个标签的节点上
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      volumes:
      - hostPath:
          path: /var/lib/kubelet/device-plugins
          type: ""
        name: device-plugin
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
  • 启动 DaemonSet
shell 复制代码
kubectl create -f device-plugin.yml
  • 给GPU节点打标签
shell 复制代码
kubectl label nodes 节点名 gpu=gpu

相关推荐
有谁看见我的剑了?14 小时前
K8s crictl 客户端学习
学习·容器·kubernetes
怒放de生命201015 小时前
前端子包+docker流水线部署+nginx代理部署
前端·nginx·docker
KubeSphere 云原生15 小时前
云原生周刊:Kubernetes 1.35 新机制与云原生生态更新
云原生·容器·kubernetes
Java程序员威哥15 小时前
云原生Java应用优化实战:资源限制+JVM参数调优,容器启动快50%
java·开发语言·jvm·python·docker·云原生
大房身镇、王师傅15 小时前
【Docker】RockyLinux10 安装 docker-compose
运维·docker·容器·docker-compose·rockylinux10
Java程序员威哥16 小时前
Java微服务可观测性实战:Prometheus+Grafana+SkyWalking全链路监控落地
java·开发语言·python·docker·微服务·grafana·prometheus
衫水16 小时前
Docker 常用指令大全(完整整合版)
运维·docker·容器
4032407316 小时前
【2026最新】Jetson全系列安装支持CUDA加速的OpenCV 4.10保姆级教程(适配Jetpack 6/5及Orin/Xavier/Nano等)
linux·opencv·计算机视觉·nvidia·cuda·jetson
腾讯数据架构师16 小时前
cube studio 存储资源对接ceph
ceph·kubernetes·cube-studio·ai平台
Python-AI Xenon16 小时前
基于RustDesk自建服务器实战指南(跨平台免费远程控制安卓设备)
docker·远程控制·rustdesk·云计算运维