k8s配置GPU感知:k8s-device-plugin的使用(已踩完坑)

1,定义

Kubernetes 的 NVIDIA 设备插件是一个 Daemonset,它允许自动:

  • 暴露集群中每个节点上的 GPU 数量
  • 跟踪 GPU 的运行状况
  • 在 Kubernetes 集群中运行支持 GPU 的容器

2,需要满足的前置条件

  • NVIDIA drivers ~= 384.81
  • nvidia-docker >= 2.0 || nvidia-container-toolkit >= 1.7.0 (>= 1.11.0 to use integrated GPUs on Tegra-based systems)
  • nvidia-container-runtime configured as the default low-level runtime
  • Kubernetes version >= 1.10

3,安装

bash 复制代码
kubect apply -f nvidia-device-plugin.yml

yaml内容如下:

复制代码
# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      containers:
      - image: 10.5.5.25:8080/nvidia/k8s-device-plugin:v0.17.0-ubi9
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

4,测试

4.1 配置yaml文件,跑一个job

复制代码
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

4.2 查看gpu-pod的log

5 遇到的问题

安装结束后,并没有发现GPU信息,通过查看/etc/docker/daemon,发现container toolkit也已经装好,但是运行docker info发现runtime还是runc,猜想可能就是这个原因,因此设置了default-runtime,如下:

复制代码
{
    "data-root":"/data/docker_data",
    "insecure-registries":[
        "192.168.237.50:8080",//私有仓库
        "127.0.0.0/8"
    ],
    "registry-mirrors":[
        "192.168.237.50:8080",//私有仓库
        "https://docker.m.daocloud.io",
        "https://docker.unsee.tech",
        "https://docker.1panel.live",
        "http://mirrors.ustc.edu.cn",
        "https://docker.chenby.cn",
        "http://mirror.azure.cn",
        "https://dockerpull.org",
        "https://dockerhub.icu",
        "https://hub.rat.dev",
        "https://proxy.1panel.live",
        "https://docker.1panel.top",
        "https://docker.m.daocloud.io",
        "https://docker.1ms.run",
        "https://docker.ketches.cn",
        "https://mirror,aliyuncs.com"
    ],
    "runtimes":{
        "nvidia":{
            "args":[],
            "path":"nvidia-container-runtime"
        }
    },
    "default-runtime":"nvidia"
}

最终实现了k8s调用GPU

相关推荐
matrixlzp32 分钟前
K8S Service 原理、案例
云原生·容器·kubernetes
angushine2 小时前
让Docker端口映射受Firewall管理而非iptables
运维·docker·容器
孔令飞3 小时前
Go:终于有了处理未定义字段的实用方案
人工智能·云原生·go
玄明Hanko3 小时前
Quarkus+Docker最全面完整教程:手把手搞定Java云原生
后端·docker·云原生
SimonLiu0094 小时前
清理HiNas(海纳斯) Docker日志并限制日志大小
java·docker·容器
高峰君主6 小时前
Docker容器持久化
docker·容器·eureka
能来帮帮蒟蒻吗6 小时前
Docker安装(Ubuntu22版)
笔记·学习·spring cloud·docker·容器
言之。11 小时前
别学了,打会王者吧
java·python·mysql·容器·spark·php·html5
秦始皇爱找茬14 小时前
docker部署Jenkins工具
docker·容器·jenkins
hoho不爱喝酒16 小时前
微服务Nacos组件的介绍、安装、使用
微服务·云原生·架构