ubuntu2204 + k8s 1.32.5 +GPU-Operator 24.9.2搭建GPU-k8s平台

文章目录


前言

同事让出一版k8s+gpu-operator的部署记录,调度pod使用加速卡。


一、版本信息

  • Ubuntu : 22.04.5 LTS
  • k8s : v1.32.5
  • GPU-Operator : v24.9.2

二、集群信息

本次实验使用两台openstack云平台创建的云主机,其中一台为控制节点;一台为GPU节点,带有实验用GPU加速卡。

机器信息 核数 内存 加速卡
控制节点 8 32G
GPU节点 42 110G 2块

三、环境准备

3.1 GPU节点 nouveau驱动环境处理

查看当前加速卡驱动

shell 复制代码
lspci -d 10de: -k
00:06.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
	Subsystem: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB]
	Kernel driver in use: nouveau
	Kernel modules: nvidiafb, nouveau
00:07.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
	Subsystem: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB]
	Kernel driver in use: nouveau
	Kernel modules: nvidiafb, nouveau

进行驱动屏蔽

shell 复制代码
sudo tee /etc/modprobe.d/blacklist-nouveau.conf <<EOF
blacklist nouveau
options nouveau modeset=0
EOF

sudo update-initramfs -u

reboot

# 重启后检查,没有输出才行
lsmod | grep nouveau

3.2 所有节点container环境安装

安装服务,container一定不要使用docker官方源

shell 复制代码
sudo apt install -y lrzsz && sudo apt install  curl socat conntrack ebtables ipset -y 

# 使用国内源 不可以使用docker源

cat <<EOF | sudo tee /etc/apt/sources.list
# 默认注释了源码镜像以提高 apt update 速度,如有需要可取消注释
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy main restricted universe multiverse

deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy-updates main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy-updates main restricted universe multiverse

deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy-backports main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy-backports main restricted universe multiverse

deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy-security main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy-security main restricted universe multiverse
EOF


sudo apt-get -y update 

sudo apt install -y containerd

containerd --version

sudo ctr version

创建配置文件

shell 复制代码
sudo mkdir -p /etc/containerd

sudo containerd config default | sudo tee /etc/containerd/config.toml


sudo sed -i "s/SystemdCgroup = false/SystemdCgroup = true/" /etc/containerd/config.toml

sudo sed -i 's|sandbox_image = "registry.k8s.io/pause:3\.8"|sandbox_image = "registry.cn-beijing.aliyuncs.com/kubesphereio/pause:3.10"|g' /etc/containerd/config.toml

配置本地私有仓库

yaml 复制代码
    [plugins."io.containerd.grpc.v1.cri".registry]
      config_path = ""

      [plugins."io.containerd.grpc.v1.cri".registry.auths]

      [plugins."io.containerd.grpc.v1.cri".registry.configs]
        [plugins."io.containerd.grpc.v1.cri".registry.configs."harbor-IP:harbor-Port".tls]
          insecure_skip_verify = true
        [plugins."io.containerd.grpc.v1.cri".registry.configs."harbor-IP:harbor-Port".auth]
          username = "admin"
          password = "Harbor12345"

      [plugins."io.containerd.grpc.v1.cri".registry.headers]

      [plugins."io.containerd.grpc.v1.cri".registry.mirrors]

      [plugins."io.containerd.grpc.v1.cri".registry.mirrors."harbor-IP:harbor-Port"]
        endpoint = ["http://harbor-IP:harbor-Port"]

GPU节点配置runtime

yaml 复制代码
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_type = "io.containerd.runc.v2"
          pod_annotations = ["nvidia.com/gpu.*"]
          container_annotations = [
            "nvidia.com/gpu.*",
            "nvidia.com/mig.*"
          ]

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"

重启服务

shell 复制代码
systemctl daemon-reload
systemctl enable --now containerd.service
systemctl restart containerd.service
systemctl status containerd.service

四、 k8s部署

我这里部署使用的是kk工具

shell 复制代码
chmod a+x kk
export KKZONE=cn
./kk create config  -f k8s-v1325.yaml --with-kubernetes v1.32.5

./kk create cluster -f ./k8s-v1325.yaml

虚拟机调整网络插件MTU值

shell 复制代码
kubectl edit configmap -n kube-system calico-config
# 将veth_mtu修改为1450
veth_mtu: "1450"
# 修改后删除pod加载配置
kubectl delete pod -n kube-system -l k8s-app=calico-node

五、安装GPU-operator

5.1 下载特定版本helm文件夹

shell 复制代码
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm pull nvidia/gpu-operator --version=24.9.2

5.2 配置修改

yaml 复制代码
operator:
  defaultRuntime: containerd
  runtimeClass: nvidia

dcgm:
  # disabled by default to use embedded nv-hostengine by exporter
  enabled: true

5.3 开始部署

shell 复制代码
# 创建命名空间
kubectl create namespace gpu-operator-resources
# 进行部署
helm install gpu-operator ../gpu-operator   -n gpu-operator-resources   -f values.yaml

#卸载命令
#helm uninstall gpu-operator -n gpu-operator-resources

#查看日志
kubectl logs -n gpu-operator-resources deploy/gpu-operator

这里需要从外面下载一些镜像才行

5.4 部署结果

shell 复制代码
kubectl  get pod -n gpu-operator-resources
# 输出
NAME                                                          READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-rfdnd                                   1/1     Running     0          15m
gpu-operator-dd585d846-g85ld                                  1/1     Running     0          16m
gpu-operator-node-feature-discovery-gc-7b698779b5-sz8p2       1/1     Running     0          16m
gpu-operator-node-feature-discovery-master-56fb77bf65-tsqsf   1/1     Running     0          16m
gpu-operator-node-feature-discovery-worker-8rjb7              1/1     Running     0          16m
gpu-operator-node-feature-discovery-worker-p2rsq              1/1     Running     0          16m
nvidia-container-toolkit-daemonset-8pwck                      1/1     Running     0          15m
nvidia-cuda-validator-r5rr8                                   0/1     Completed   0          12m
nvidia-dcgm-exporter-69prx                                    1/1     Running     0          15m
nvidia-dcgm-qbx6t                                             1/1     Running     0          15m
nvidia-device-plugin-daemonset-ncxmz                          1/1     Running     0          15m
nvidia-driver-daemonset-s2jwg                                 1/1     Running     0          16m
nvidia-operator-validator-72fnp                               1/1     Running     0          15

六、检查GPU信息

shell 复制代码
kubectl describe node  GPU-test |grep Allocatable -A 10
# 输出
Allocatable:
  cpu:                41600m
  ephemeral-storage:  515923476Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             111715854453
  nvidia.com/gpu:     2
  pods:               110
System Info:
  Machine ID:                 3cc7ba2f5dbe4a9cbb18ff55e1c7be6b
  System UUID:                174e2614-9f69-4ebe-b067-8e64cd77a6c6

七、部署jupyter训练测试实例

shell 复制代码
cat > notebook-example.yml << EOF
apiVersion: v1
kind: Service
metadata:
  name: tf-notebook
  labels:
    app: tf-notebook
spec:
  type: NodePort
  ports:
  - port: 80
    name: http
    targetPort: 8888
    nodePort: 30101
  selector:
    app: tf-notebook
---
apiVersion: v1
kind: Pod
metadata:
  name: tf-notebook
  labels:
    app: tf-notebook
spec:
  securityContext:
    fsGroup: 0
  containers:
  - name: tf-notebook
    image: docker.1ms.run/tensorflow/tensorflow:latest-gpu-jupyter
    resources:
      limits:
        nvidia.com/gpu: 1
    ports:
    - containerPort: 8

EOF

# 这个镜像很大,提前在GPU节点上下载
sudo ctr -n=k8s.io i pull  docker.1ms.run/tensorflow/tensorflow:latest-gpu-jupyter

# 创建实例
kubectl apply -f notebook-example.yml

# 通过日志获取token信息
kubectl logs tf-notebook

# 查看加速卡使用率
 kubectl exec tf-notebook -- nvidia-smi

访问页面:http:😕/:30101/?token=


总结

各种组件都很好用,就是镜像方面需要提前准备,我把ctr镜像下载然后改tag上传私有库的命令贴下面

shell 复制代码
sudo ctr -n=k8s.io image pull nvcr.io/nvidia/cloud-native/gpu-operator:v24.9.2   --all-platforms
sudo ctr -n=k8s.io i tag nvcr.io/nvidia/cloud-native/gpu-operator:v24.9.2  harbor-IP:harbor-Port/nvidia/cloud-native/gpu-operator:v24.9.2 
sudo ctr -n=k8s.io i push --plain-http=true  -u admin:Harbor12345  harbor-IP:harbor-Port/nvidia/cloud-native/gpu-operator:v24.9.2 

加上--all-platforms标签在修改后上传不会提示缺层

相关推荐
❀͜͡傀儡师2 小时前
Docker打造全能媒体中心Plex
docker·容器·媒体
懒鸟一枚2 小时前
k8s 之minikube安装看k8s
云原生·容器·kubernetes
qq_4557608515 小时前
docker - 镜像、存储卷和网络深入理解
运维·docker·容器
Mintopia17 小时前
🐱 LongCat-Image:当AI绘画说上了流利的中文,还减掉了40斤参数 | 共绩算力
人工智能·云原生·aigc
java1234_小锋17 小时前
Zookeeper分布式锁如何实现?
分布式·zookeeper·云原生
木童66217 小时前
Ruo-Yi 项目 CICD 完整部署文档(含命令详解)
ci/cd·docker·容器
幺零九零零19 小时前
Docker底层- 命令详解
运维·docker·容器
深圳行云创新20 小时前
行云创新 AI+CloudOS:AI + 云原生落地新范式
人工智能·云原生·系统架构
庸子20 小时前
Kubernetes 可观测性实战:解构 Prometheus + Grafana 企业级监控架构
kubernetes·grafana·prometheus