文章目录
- [1. 安装 nvidia-container-runtime(接入k8s前)](#1. 安装 nvidia-container-runtime(接入k8s前))
-
- [1. 1 安装](#1. 1 安装)
- [1.2 使用](#1.2 使用)
- [2. 其他插件](#2. 其他插件)
-
- [2.1 nvidia-fabricmanager](#2.1 nvidia-fabricmanager)
- [2.2 datacenter-gpu-manager.tar.gz](#2.2 datacenter-gpu-manager.tar.gz)
- [3. 节点接入K8S集群](#3. 节点接入K8S集群)
- [4. 安装NVIDIA插件](#4. 安装NVIDIA插件)
1. 安装 nvidia-container-runtime(接入k8s前)
1. 1 安装
- 下边这些包,我的ubuntu系统
shell
root@liubei:~/nvidia-container-runtime# ll
total 7368
drwxr-xr-x 2 root root 4096 Nov 14 09:28 ./
drwx------ 8 root root 4096 Nov 14 13:11 ../
-rw-r--r-- 1 root root 20068 Jun 29 2023 gcc-12-base_12.3.0-1ubuntu1~22.04_amd64.deb
-rw-r--r-- 1 root root 3235218 May 29 16:10 libc6_2.35-0ubuntu3.8_amd64.deb
-rw-r--r-- 1 root root 18334 Jun 14 2023 libcap2_1%3a2.44-1ubuntu0.22.04.1_amd64.deb
-rw-r--r-- 1 root root 82004 Dec 18 2021 libcrypt1_1%3a4.4.27-1_amd64.deb
-rw-r--r-- 1 root root 53888 Jun 29 2023 libgcc-s1_12.3.0-1ubuntu1~22.04_amd64.deb
-rw-r--r-- 1 root root 928036 Nov 9 02:43 libnvidia-container1_1.13.5-1_amd64.deb
-rw-r--r-- 1 root root 24908 Nov 9 02:43 libnvidia-container-tools_1.13.5-1_amd64.deb
-rw-r--r-- 1 root root 47424 Mar 17 2022 libseccomp2_2.5.3-2ubuntu2_amd64.deb
-rw-r--r-- 1 root root 4988 Nov 9 02:43 nvidia-container-runtime_3.13.0-1_all.deb
-rw-r--r-- 1 root root 853436 Nov 9 02:43 nvidia-container-toolkit_1.13.5-1_amd64.deb
-rw-r--r-- 1 root root 2244092 Nov 9 02:43 nvidia-container-toolkit-base_1.13.5-1_amd64.deb
- 安装
shell
dpkg -i *.deb
1.2 使用
docker服务使用它
- 修改
/etc/docker/daemon.json文件,内容如下:
json
{
"exec-opts": ["native.cgroupdriver=cgroupfs"],
"default-runtime":"nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
- 查看结果
shell
root@liubei:/# docker info|grep nvidia
Runtimes: runc io.containerd.runc.v2 nvidia
Default Runtime: nvidia
containerd服务使用它
修改/etc/containerd/config.toml 文件中:
shell
BinaryName = "nvidia-container-runtime"
为了便于找到,贴出整段内容:
shell
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
BinaryName = "nvidia-container-runtime"
CriuImagePath = ""
CriuPath = ""
CriuWorkPath = ""
IoGid = 0
IoUid = 0
NoNewKeyring = false
NoPivotRoot = false
Root = ""
ShimCgroup = ""
SystemdCgroup = true
还用注意:
SystemdCgroup = true,如果你之前安装containerd没有修改,这里一定要改过来。
- 查看结果
shell
root@cto-gpu-pro-n01:/opt/cni/bin# crictl info |grep nvidia
"BinaryName": "nvidia-container-runtime",
2. 其他插件
2.1 nvidia-fabricmanager
- 我缓存在本地的包
shell
root@liubei:nvidia-fabricmanager-dev-565 # ll
总用量 9552
-rw-r----- 1 root root 20068 12月 5 10:13 gcc-12-base_12.3.0-1ubuntu1~22.04_amd64.deb
-rw-r----- 1 root root 3235218 12月 5 10:13 libc6_2.35-0ubuntu3.8_amd64.deb
-rw-r----- 1 root root 82004 12月 5 10:13 libcrypt1_1%3a4.4.27-1_amd64.deb
-rw-r----- 1 root root 53888 12月 5 10:13 libgcc-s1_12.3.0-1ubuntu1~22.04_amd64.deb
-rw-r----- 1 root root 5793340 12月 5 10:13 nvidia-fabricmanager-565_565.57.01-1_amd64.deb
-rw-r----- 1 root root 584610 12月 5 10:13 nvidia-fabricmanager-dev-565_565.57.01-1_amd64.deb
- 安装
shell
dpkg -i *.deb
- 安装完成后会自动启动
shell
root@boe:~# systemctl status nvidia-fabricmanager
● nvidia-fabricmanager.service - NVIDIA fabric manager service
Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2024-11-25 00:48:32 UTC; 1 month 0 days ago
Main PID: 1469962 (nv-fabricmanage)
Tasks: 18 (limit: 629145)
Memory: 5.2M
CPU: 32min 31.547s
CGroup: /system.slice/nvidia-fabricmanager.service
└─1469962 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg
Nov 25 00:48:30 boe systemd[1]: Starting NVIDIA fabric manager service...
Nov 25 00:48:31 boe nv-fabricmanager[1469962]: Connected to 1 node.
Nov 25 00:48:32 boe nv-fabricmanager[1469962]: Successfully configured all the available NVSwitches to route GPU NVLink traffic. NVLink Peer-to-Peer support will be enabled once the GPUs are success>
Nov 25 00:48:32 boe systemd[1]: Started NVIDIA fabric manager service.
lines 1-14/14 (END)
2.2 datacenter-gpu-manager.tar.gz
- 我缓存在本地的包
shell
root@liubei:datacenter-gpu-manager # ll
总用量 893172
-rw-r--r-- 1 root root 911205922 11月 14 09:01 datacenter-gpu-manager_1%3a3.3.9_amd64.deb
-rw-r--r-- 1 root root 20068 6月 29 2023 gcc-12-base_12.3.0-1ubuntu1~22.04_amd64.deb
-rw-r--r-- 1 root root 3235218 5月 30 2024 libc6_2.35-0ubuntu3.8_amd64.deb
-rw-r--r-- 1 root root 82004 12月 18 2021 libcrypt1_1%3a4.4.27-1_amd64.deb
-rw-r--r-- 1 root root 53888 6月 29 2023 libgcc-s1_12.3.0-1ubuntu1~22.04_amd64.deb
- 安装
shell
dpkg -i *.deb
3. 节点接入K8S集群
操作同普通cpu节点4
4. 安装NVIDIA插件
- 创建
device-plugin.yml文件内容如下:
yml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
name: nvidia-device-plugin-ds
template:
metadata:
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
creationTimestamp: null
labels:
name: nvidia-device-plugin-ds
spec:
containers:
- image: harbocto.xxx.com.cn/bdteam/k8s-device-plugin:1.11
imagePullPolicy: IfNotPresent
name: nvidia-device-plugin-ctr
resources: {}
securityContext:
capabilities:
drop:
- ALL
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/kubelet/device-plugins
name: device-plugin
dnsPolicy: ClusterFirst
nodeSelector:
gpu: gpu #插件会启动在有这个标签的节点上
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
tolerations:
- key: CriticalAddonsOnly
operator: Exists
volumes:
- hostPath:
path: /var/lib/kubelet/device-plugins
type: ""
name: device-plugin
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
- 启动 DaemonSet
shell
kubectl create -f device-plugin.yml
- 给GPU节点打标签
shell
kubectl label nodes 节点名 gpu=gpu
