文章目录
- 前言
- 一、版本信息
- 二、集群信息
- 三、环境准备
-
- [3.1 GPU节点 nouveau驱动环境处理](#3.1 GPU节点 nouveau驱动环境处理)
- [3.2 所有节点container环境安装](#3.2 所有节点container环境安装)
- [四、 k8s部署](#四、 k8s部署)
- 五、安装GPU-operator
-
- [5.1 下载特定版本helm文件夹](#5.1 下载特定版本helm文件夹)
- [5.2 配置修改](#5.2 配置修改)
- [5.3 开始部署](#5.3 开始部署)
- [5.4 部署结果](#5.4 部署结果)
- 六、检查GPU信息
- 七、部署jupyter训练测试实例
- 总结
前言
同事让出一版k8s+gpu-operator的部署记录,调度pod使用加速卡。
一、版本信息
- Ubuntu : 22.04.5 LTS
- k8s : v1.32.5
- GPU-Operator : v24.9.2
二、集群信息
本次实验使用两台openstack云平台创建的云主机,其中一台为控制节点;一台为GPU节点,带有实验用GPU加速卡。
| 机器信息 | 核数 | 内存 | 加速卡 |
|---|---|---|---|
| 控制节点 | 8 | 32G | 无 |
| GPU节点 | 42 | 110G | 2块 |
三、环境准备
3.1 GPU节点 nouveau驱动环境处理
查看当前加速卡驱动
shell
lspci -d 10de: -k
00:06.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
Subsystem: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB]
Kernel driver in use: nouveau
Kernel modules: nvidiafb, nouveau
00:07.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
Subsystem: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB]
Kernel driver in use: nouveau
Kernel modules: nvidiafb, nouveau
进行驱动屏蔽
shell
sudo tee /etc/modprobe.d/blacklist-nouveau.conf <<EOF
blacklist nouveau
options nouveau modeset=0
EOF
sudo update-initramfs -u
reboot
# 重启后检查,没有输出才行
lsmod | grep nouveau
3.2 所有节点container环境安装
安装服务,container一定不要使用docker官方源
shell
sudo apt install -y lrzsz && sudo apt install curl socat conntrack ebtables ipset -y
# 使用国内源 不可以使用docker源
cat <<EOF | sudo tee /etc/apt/sources.list
# 默认注释了源码镜像以提高 apt update 速度,如有需要可取消注释
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy-updates main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy-updates main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy-backports main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy-backports main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy-security main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy-security main restricted universe multiverse
EOF
sudo apt-get -y update
sudo apt install -y containerd
containerd --version
sudo ctr version
创建配置文件
shell
sudo mkdir -p /etc/containerd
sudo containerd config default | sudo tee /etc/containerd/config.toml
sudo sed -i "s/SystemdCgroup = false/SystemdCgroup = true/" /etc/containerd/config.toml
sudo sed -i 's|sandbox_image = "registry.k8s.io/pause:3\.8"|sandbox_image = "registry.cn-beijing.aliyuncs.com/kubesphereio/pause:3.10"|g' /etc/containerd/config.toml
配置本地私有仓库
yaml
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = ""
[plugins."io.containerd.grpc.v1.cri".registry.auths]
[plugins."io.containerd.grpc.v1.cri".registry.configs]
[plugins."io.containerd.grpc.v1.cri".registry.configs."harbor-IP:harbor-Port".tls]
insecure_skip_verify = true
[plugins."io.containerd.grpc.v1.cri".registry.configs."harbor-IP:harbor-Port".auth]
username = "admin"
password = "Harbor12345"
[plugins."io.containerd.grpc.v1.cri".registry.headers]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."harbor-IP:harbor-Port"]
endpoint = ["http://harbor-IP:harbor-Port"]
GPU节点配置runtime
yaml
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_type = "io.containerd.runc.v2"
pod_annotations = ["nvidia.com/gpu.*"]
container_annotations = [
"nvidia.com/gpu.*",
"nvidia.com/mig.*"
]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
重启服务
shell
systemctl daemon-reload
systemctl enable --now containerd.service
systemctl restart containerd.service
systemctl status containerd.service
四、 k8s部署
我这里部署使用的是kk工具
shell
chmod a+x kk
export KKZONE=cn
./kk create config -f k8s-v1325.yaml --with-kubernetes v1.32.5
./kk create cluster -f ./k8s-v1325.yaml
虚拟机调整网络插件MTU值
shell
kubectl edit configmap -n kube-system calico-config
# 将veth_mtu修改为1450
veth_mtu: "1450"
# 修改后删除pod加载配置
kubectl delete pod -n kube-system -l k8s-app=calico-node
五、安装GPU-operator
5.1 下载特定版本helm文件夹
shell
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm pull nvidia/gpu-operator --version=24.9.2
5.2 配置修改
yaml
operator:
defaultRuntime: containerd
runtimeClass: nvidia
dcgm:
# disabled by default to use embedded nv-hostengine by exporter
enabled: true
5.3 开始部署
shell
# 创建命名空间
kubectl create namespace gpu-operator-resources
# 进行部署
helm install gpu-operator ../gpu-operator -n gpu-operator-resources -f values.yaml
#卸载命令
#helm uninstall gpu-operator -n gpu-operator-resources
#查看日志
kubectl logs -n gpu-operator-resources deploy/gpu-operator
这里需要从外面下载一些镜像才行
5.4 部署结果
shell
kubectl get pod -n gpu-operator-resources
# 输出
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-rfdnd 1/1 Running 0 15m
gpu-operator-dd585d846-g85ld 1/1 Running 0 16m
gpu-operator-node-feature-discovery-gc-7b698779b5-sz8p2 1/1 Running 0 16m
gpu-operator-node-feature-discovery-master-56fb77bf65-tsqsf 1/1 Running 0 16m
gpu-operator-node-feature-discovery-worker-8rjb7 1/1 Running 0 16m
gpu-operator-node-feature-discovery-worker-p2rsq 1/1 Running 0 16m
nvidia-container-toolkit-daemonset-8pwck 1/1 Running 0 15m
nvidia-cuda-validator-r5rr8 0/1 Completed 0 12m
nvidia-dcgm-exporter-69prx 1/1 Running 0 15m
nvidia-dcgm-qbx6t 1/1 Running 0 15m
nvidia-device-plugin-daemonset-ncxmz 1/1 Running 0 15m
nvidia-driver-daemonset-s2jwg 1/1 Running 0 16m
nvidia-operator-validator-72fnp 1/1 Running 0 15
六、检查GPU信息
shell
kubectl describe node GPU-test |grep Allocatable -A 10
# 输出
Allocatable:
cpu: 41600m
ephemeral-storage: 515923476Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 111715854453
nvidia.com/gpu: 2
pods: 110
System Info:
Machine ID: 3cc7ba2f5dbe4a9cbb18ff55e1c7be6b
System UUID: 174e2614-9f69-4ebe-b067-8e64cd77a6c6
七、部署jupyter训练测试实例
shell
cat > notebook-example.yml << EOF
apiVersion: v1
kind: Service
metadata:
name: tf-notebook
labels:
app: tf-notebook
spec:
type: NodePort
ports:
- port: 80
name: http
targetPort: 8888
nodePort: 30101
selector:
app: tf-notebook
---
apiVersion: v1
kind: Pod
metadata:
name: tf-notebook
labels:
app: tf-notebook
spec:
securityContext:
fsGroup: 0
containers:
- name: tf-notebook
image: docker.1ms.run/tensorflow/tensorflow:latest-gpu-jupyter
resources:
limits:
nvidia.com/gpu: 1
ports:
- containerPort: 8
EOF
# 这个镜像很大,提前在GPU节点上下载
sudo ctr -n=k8s.io i pull docker.1ms.run/tensorflow/tensorflow:latest-gpu-jupyter
# 创建实例
kubectl apply -f notebook-example.yml
# 通过日志获取token信息
kubectl logs tf-notebook
# 查看加速卡使用率
kubectl exec tf-notebook -- nvidia-smi
访问页面:http:😕/:30101/?token=
总结
各种组件都很好用,就是镜像方面需要提前准备,我把ctr镜像下载然后改tag上传私有库的命令贴下面
shell
sudo ctr -n=k8s.io image pull nvcr.io/nvidia/cloud-native/gpu-operator:v24.9.2 --all-platforms
sudo ctr -n=k8s.io i tag nvcr.io/nvidia/cloud-native/gpu-operator:v24.9.2 harbor-IP:harbor-Port/nvidia/cloud-native/gpu-operator:v24.9.2
sudo ctr -n=k8s.io i push --plain-http=true -u admin:Harbor12345 harbor-IP:harbor-Port/nvidia/cloud-native/gpu-operator:v24.9.2
加上--all-platforms标签在修改后上传不会提示缺层