k8s配置GPU感知:k8s-device-plugin的使用(已踩完坑)

1,定义

Kubernetes 的 NVIDIA 设备插件是一个 Daemonset,它允许自动:

  • 暴露集群中每个节点上的 GPU 数量
  • 跟踪 GPU 的运行状况
  • 在 Kubernetes 集群中运行支持 GPU 的容器

2,需要满足的前置条件

  • NVIDIA drivers ~= 384.81
  • nvidia-docker >= 2.0 || nvidia-container-toolkit >= 1.7.0 (>= 1.11.0 to use integrated GPUs on Tegra-based systems)
  • nvidia-container-runtime configured as the default low-level runtime
  • Kubernetes version >= 1.10

3,安装

bash 复制代码
kubect apply -f nvidia-device-plugin.yml

yaml内容如下:

复制代码
# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      containers:
      - image: 10.5.5.25:8080/nvidia/k8s-device-plugin:v0.17.0-ubi9
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

4,测试

4.1 配置yaml文件,跑一个job

复制代码
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

4.2 查看gpu-pod的log

5 遇到的问题

安装结束后,并没有发现GPU信息,通过查看/etc/docker/daemon,发现container toolkit也已经装好,但是运行docker info发现runtime还是runc,猜想可能就是这个原因,因此设置了default-runtime,如下:

复制代码
{
    "data-root":"/data/docker_data",
    "insecure-registries":[
        "192.168.237.50:8080",//私有仓库
        "127.0.0.0/8"
    ],
    "registry-mirrors":[
        "192.168.237.50:8080",//私有仓库
        "https://docker.m.daocloud.io",
        "https://docker.unsee.tech",
        "https://docker.1panel.live",
        "http://mirrors.ustc.edu.cn",
        "https://docker.chenby.cn",
        "http://mirror.azure.cn",
        "https://dockerpull.org",
        "https://dockerhub.icu",
        "https://hub.rat.dev",
        "https://proxy.1panel.live",
        "https://docker.1panel.top",
        "https://docker.m.daocloud.io",
        "https://docker.1ms.run",
        "https://docker.ketches.cn",
        "https://mirror,aliyuncs.com"
    ],
    "runtimes":{
        "nvidia":{
            "args":[],
            "path":"nvidia-container-runtime"
        }
    },
    "default-runtime":"nvidia"
}

最终实现了k8s调用GPU

相关推荐
脑瓜嗡3 小时前
Docker部署SpringBoot项目
spring boot·docker·容器
容器魔方4 小时前
KubeCon China 2025 | 与KubeEdge畅聊毕业经验与创新未来
云原生·容器·云计算
代码小学僧6 小时前
通俗易懂:给前端开发者的 Docker 入门指南
前端·docker·容器
运维潇哥6 小时前
k8s业务程序联调工具-KtConnect
云原生·容器·kubernetes
欧先生^_^6 小时前
让 Kubernetes (K8s) 集群 使用 GPU
云原生·容器·kubernetes
饺子大魔王的男人7 小时前
Docker环境下FileRise私有云盘在飞牛NAS的部署与穿透实践
运维·docker·容器
炎码工坊9 小时前
API网关Envoy的鉴权与限流:构建安全可靠的微服务网关
网络安全·微服务·云原生·系统安全·安全架构
炎码工坊9 小时前
从零掌握微服务通信安全:mTLS全解析
安全·网络安全·云原生·系统安全·安全架构
Lw老王要学习9 小时前
Linux容器篇、第一章_02Rocky9.5 系统下 Docker 的持久化操作与 Dockerfile 指令详解
linux·运维·docker·容器·云计算
炎码工坊10 小时前
API网关Kong的鉴权与限流:高并发场景下的核心实践
安全·网络安全·微服务·云原生·系统安全