如何在k8s中配置并使用nvidia显卡

0. 安装驱动依赖

0.1 安装cuda

sh 复制代码
# 参考https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_network
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-13-0

0.2 安装驱动

sh 复制代码
# 参考https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_network
sudo apt-get install -y cuda-drivers

1. 安装 nvidia container toolkit

sh 复制代码
# 参考:https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
sudo apt-get update && sudo apt-get install -y --no-install-recommends \
   curl \
   gnupg2
   
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
    
sudo sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.18.0-1
  sudo apt-get install -y \
      nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}

重启container

sh 复制代码
sudo nvidia-ctk runtime configure --runtime=containerd
# 默认情况下,该nvidia-ctk命令会创建一个/etc/containerd/conf.d/99-nvidia.toml 临时配置文件,并修改(或创建)该/etc/containerd/config.toml文件以确保imports配置选项得到相应更新。该临时配置文件确保 containerd 可以使用 NVIDIA 容器运行时。
sudo systemctl restart containerd

2. 配置nvidia k8s插件

参考:https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

2.1 创建RuntimeClass

需要在nvidia-device-plugin.yml中调用

yaml 复制代码
cat <<EOF | kubectl create -f -
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia
EOF

或者

sh 复制代码
sudo nvidia-ctk runtime configure --runtime=containerd --nvidia-set-as-default # 默认使用 nvidia runtime
sudo systemctl restart containerd

2.2 创建 nvidia-device-plugin

方式一:

sh 复制代码
# 注意:需默认使用 nvidia runtime, nvidia-ctk runtime configure --runtime=containerd --nvidia-set-as-default
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.1/deployments/static/nvidia-device-plugin.yml

方式二:

sh 复制代码
# 获取yaml文件
wget https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.1/deployments/static/nvidia-device-plugin.yml

# 在yaml文件中加入字段:runtimeClassName: nvidia
如:
apiVersion: apps/v1
kind: DaemonSet
...
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
...
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      runtimeClassName: nvidia     ## 添加到这里
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.17.1
        name: nvidia-device-plugin-ctr

执行

sh 复制代码
kubectl create -f nvidia-device-plugin.yml

3. 验证

sh 复制代码
# 1. 查看nvidia-device-plugin pod
kubectl describe pod nvidia-device-plugin-daemonset-sm24n -n kube-system
结果:
Name:                 nvidia-device-plugin-daemonset-sm24n
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Runtime Class Name:   nvidia
...
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  27s   default-scheduler  Successfully assigned kube-system/nvidia-device-plugin-daemonset-sm24n to master
  Normal  Pulled     26s   kubelet            Container image "nvcr.io/nvidia/k8s-device-plugin:v0.17.1" already present on machine
  Normal  Created    26s   kubelet            Created container nvidia-device-plugin-ctr
  Normal  Started    26s   kubelet            Started container nvidia-device-plugin-ctr


# 2. 查看node 中是否已经有了nvida 的resource
kubectl describe node master
结果:
Name:               master
Roles:              control-plane
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    feature.node.kubernetes.io/cpu-cpuid.ADX=true
                    feature.node.kubernetes.io/cpu-cpuid.AESNI=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX2=true
....
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                2100m (6%)   1900m (5%)
  memory             3088Mi (9%)  8696Mi (27%)
  ephemeral-storage  0 (0%)       0 (0%)
  hugepages-1Gi      0 (0%)       0 (0%)
  hugepages-2Mi      0 (0%)       0 (0%)
  nvidia.com/gpu     0            0             # nvidia 信息
  
# 3. 如果gpu可用,通过官方测试脚本加载gpu
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF

# 通过 logs查看结果
kubectl logs gpu-pod
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

4. 常见问题

nvidia-device-plugin未发现可用gpu

nvidia-device-plugin 的pod describeti提示没有发现可以gpu

在驱动、runtime都正确安全的情况下一般是,运行时的问题

通过创建RuntimeClass或者在nvidia-ctk 中添加--nvidia-set-as-default解决,参考第2步。

gpu-pod报错问题

sh 复制代码
kubectl logs gpu-pod
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
[Vector addition of 50000 elements]

版本问题:cuda-sample:vectoradd-cuda12.5.0

相关推荐
凌睿马44 分钟前
离线的银河麒麟系统部署ollama
云原生·eureka
java1234_小锋1 小时前
【吊打面试官系列-ZooKeeper面试题】zookeeper 是如何保证事务的顺序一致性的?
分布式·zookeeper·云原生
my19587021351 小时前
ZooKeeper分布式协调从入门到实战
分布式·zookeeper·云原生
oioihoii1 小时前
ZooKeeper 三节点集群部署:别再单机玩,高可用强一致集群这样搭
分布式·zookeeper·云原生
云游牧者3 小时前
K8S-Helm包管理全解-从入门到Chart开发实战指南
云原生·容器·kubernetes·helm·chart模板
codeejun3 小时前
每日一Go-66、K8s 蓝绿发布 & 金丝雀发布实战:Service 切流量 + Ingress 灰度一次讲透
开发语言·golang·kubernetes
口喜口喜4 小时前
K3s 安装笔记(CentOS 7.9)
kubernetes
Elastic 中国社区官方博客4 小时前
一个查询,无限 Elasticsearch Serverless 项目:跨项目搜索介绍
大数据·elasticsearch·搜索引擎·信息可视化·云原生·serverless·全文检索
思诺学长5 小时前
从0理解Feed流系统:技术原理、架构设计与实战指南
云原生
程序员老邢5 小时前
《技术底稿 41》从三机混跑到四机隔离:微服务集群环境拆分实战复盘
微服务·云原生·架构·devops·服务器运维·技术底稿·环境隔离