记K8S集群工作节点,AnolisOS 8.6部署显卡驱动集成Containerd运行时

1、安装gcc

#安装编译环境

复制代码
yum -y install make gcc gcc-c++

2、下载显卡驱动

点击 直达连接

nvidia高级搜索下载历史版本驱动程序(下载历史版本驱动)

复制代码
https://www.nvidia.cn/Download/Find.aspx?lang=cn

3、安装驱动

安装显卡驱动

复制代码
 ./NVIDIA-Linux-x86_64-535.98.run  -m=kernel-open

4、修改系统参数,更新内核,重启服务器

复制代码
rm -f /etc/modprobe.d/blacklist-nvidia-nouveau.conf /etc/modprobe.d/nvidia-unsupported-gpu.conf
echo blacklist nouveau | tee /etc/modprobe.d/blacklist-nvidia-nouveau.conf && \
	echo options nouveau modeset=0 | tee -a /etc/modprobe.d/blacklist-nvidia-nouveau.conf && \
	echo options nvidia NVreg_OpenRmEnableUnsupportedGpus=1 | tee /etc/modprobe.d/nvidia-unsupported-gpu.conf && \
	dracut --force && \
	 /sbin/reboot

5、检查驱动

执行nvidia-smi

复制代码
Wed Aug 16 13:46:06 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.98                 Driver Version: 535.98       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:13:00.0 Off |                  N/A |
| 32%   21C    P8               8W / 350W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

6、安装nvidia-container-runtime

#安装源

复制代码
curl -s -L https://nvidia.github.io/libnvidia-container/centos8/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo

#安装容器运行时

复制代码
yum install -y nvidia-container-runtime

7、修改containerd配置文件

7.1、增加如下配置

复制代码
  [plugins."io.containerd.runtime.v1.linux"]
    no_shim = false
    runtime = "nvidia-container-runtime"
    runtime_root = ""
    shim = "containerd-shim"
    shim_debug = false

7.2、修改container配置

复制代码
修改前:runtime_type = "io.containerd.runc.v2" 
修改后:runtime_type = "io.containerd.runtime.v1.linux"

7.3、完整配置文件

复制代码
[root@ai-4 containerd]# pwd
/etc/containerd
[root@ai-4 containerd]# cat config.toml
version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
oom_score = 0

[grpc]
  address = "/run/containerd/containerd.sock"
  uid = 0
  gid = 0
  max_recv_message_size = 16777216
  max_send_message_size = 16777216

[debug]
  address = "/run/containerd/containerd-debug.sock"
  uid = 0
  gid = 0
  level = "warn"

[timeouts]
  "io.containerd.timeout.shim.cleanup" = "5s"
  "io.containerd.timeout.shim.load" = "5s"
  "io.containerd.timeout.shim.shutdown" = "3s"
  "io.containerd.timeout.task.state" = "2s"

[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    sandbox_image = "sealos.hub:5000/pause:3.2"
    max_container_log_line_size = -1
    max_concurrent_downloads = 20
    disable_apparmor = true
    [plugins."io.containerd.grpc.v1.cri".containerd]
      snapshotter = "overlayfs"
      default_runtime_name = "runc"
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runtime.v1.linux"
          runtime_engine = ""
          runtime_root = ""
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            SystemdCgroup = true
    [plugins."io.containerd.grpc.v1.cri".registry]
      config_path = "/etc/containerd/certs.d"
      [plugins."io.containerd.grpc.v1.cri".registry.configs]
          [plugins."io.containerd.grpc.v1.cri".registry.configs."sealos.hub:5000".auth]
            username = "admin"
            password = "***********"
  [plugins."io.containerd.runtime.v1.linux"]
    no_shim = false
    runtime = "nvidia-container-runtime"
    runtime_root = ""
    shim = "containerd-shim"
    shim_debug = false

8、测试containerd下显卡是否正常加载显卡

复制代码
[root@ai-4 containerd]# ctr run --rm --gpus 0 docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi nvidia-smi
Wed Aug 16 05:57:19 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.98                 Driver Version: 535.98       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:13:00.0 Off |                  N/A |
| 32%   21C    P8               8W / 350W |      4MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

9、K8S部署插件支持显卡(如果没有部署可通过如下命令部署,K8S Master上执行)

复制代码
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.7.1/nvidia-device-plugin.yml

10、K8S检查对应节点是否有GPU资源

复制代码
[root@k8s-master-17227100216 ~]# kubectl describe node node9 |grep gpu
                    gpu/type=nvidia
  nvidia.com/gpu:     1
  nvidia.com/gpu:     1
  nvidia.com/gpu     0           0

11、部署GPU测试容器

复制代码
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      #image: "k8s.gcr.io/cuda-vector-add:v0.1"
      image: "docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04"
      command:
      - nvidia-smi
      resources:
        limits:
          nvidia.com/gpu: 1
相关推荐
夏日听雨眠7 小时前
LInux(逻辑地址与物理地址的区别,文件描述符,lseek函数)
linux·运维·网络
哲霖软件8 小时前
ERP 赋能非标自动化行业:破解物料与库存管理难题
运维·自动化
qq_542515419 小时前
Ubuntu 22.04.4 LTS安装ToDesk最新版打不开,无响应?旧版本4.7.2_277版本分享
linux·ubuntu·todesk
火车叼位9 小时前
替代 Tiny Win10 的 Linux 方案:Debian XFCE 精简桌面搭建
linux·运维
小麦嵌入式9 小时前
FPGA入门(四):时序逻辑计数器原理与 LED 闪烁实现
linux·驱动开发·stm32·嵌入式硬件·fpga开发·硬件工程·dsp开发
皮卡蛋炒饭.10 小时前
传输层协议UDP
linux·网络协议·udp
syagain_zsx11 小时前
Linux指令初识(实用篇)
linux·运维·服务器
OYangxf11 小时前
Git Commit Message
运维·git
Alter123011 小时前
从“力大砖飞”到“拟态共生”,新华三定义AI基础设施的系统级进化
大数据·运维·人工智能
王木风11 小时前
终端里的编程副驾:DeepSeek-TUI-项目深度拆解,实测与原理分析
linux·运维·人工智能·rust·node.js