k8s配置GPU感知:k8s-device-plugin的使用(已踩完坑)

1,定义

Kubernetes 的 NVIDIA 设备插件是一个 Daemonset,它允许自动:

  • 暴露集群中每个节点上的 GPU 数量
  • 跟踪 GPU 的运行状况
  • 在 Kubernetes 集群中运行支持 GPU 的容器

2,需要满足的前置条件

  • NVIDIA drivers ~= 384.81
  • nvidia-docker >= 2.0 || nvidia-container-toolkit >= 1.7.0 (>= 1.11.0 to use integrated GPUs on Tegra-based systems)
  • nvidia-container-runtime configured as the default low-level runtime
  • Kubernetes version >= 1.10

3,安装

bash 复制代码
kubect apply -f nvidia-device-plugin.yml

yaml内容如下:

复制代码
# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      containers:
      - image: 10.5.5.25:8080/nvidia/k8s-device-plugin:v0.17.0-ubi9
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

4,测试

4.1 配置yaml文件,跑一个job

复制代码
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

4.2 查看gpu-pod的log

5 遇到的问题

安装结束后,并没有发现GPU信息,通过查看/etc/docker/daemon,发现container toolkit也已经装好,但是运行docker info发现runtime还是runc,猜想可能就是这个原因,因此设置了default-runtime,如下:

复制代码
{
    "data-root":"/data/docker_data",
    "insecure-registries":[
        "192.168.237.50:8080",//私有仓库
        "127.0.0.0/8"
    ],
    "registry-mirrors":[
        "192.168.237.50:8080",//私有仓库
        "https://docker.m.daocloud.io",
        "https://docker.unsee.tech",
        "https://docker.1panel.live",
        "http://mirrors.ustc.edu.cn",
        "https://docker.chenby.cn",
        "http://mirror.azure.cn",
        "https://dockerpull.org",
        "https://dockerhub.icu",
        "https://hub.rat.dev",
        "https://proxy.1panel.live",
        "https://docker.1panel.top",
        "https://docker.m.daocloud.io",
        "https://docker.1ms.run",
        "https://docker.ketches.cn",
        "https://mirror,aliyuncs.com"
    ],
    "runtimes":{
        "nvidia":{
            "args":[],
            "path":"nvidia-container-runtime"
        }
    },
    "default-runtime":"nvidia"
}

最终实现了k8s调用GPU

相关推荐
陈桴浮海1 小时前
Kustomize实战:从0到1实现K8s多环境配置管理与资源部署
云原生·容器·kubernetes
张小凡vip2 小时前
Kubernetes--k8s中部署redis数据库服务
redis·kubernetes
Hello.Reader3 小时前
Flink Kubernetes HA(高可用)实战原理、前置条件、配置项与数据保留机制
贪心算法·flink·kubernetes
ShiLiu_mtx4 小时前
k8s - 7
云原生·容器·kubernetes
MonkeyKing_sunyuhua7 小时前
docker compose up -d --build 完全使用新代码打包的方法
docker·容器·eureka
醇氧7 小时前
【docker】mysql 8 的健康检查(Health Check)
mysql·docker·容器
匀泪11 小时前
云原生(LVS NAT模式集群实验)
服务器·云原生·lvs
70asunflower11 小时前
用Docker创建不同的容器类型
运维·docker·容器
CodeGolang12 小时前
Docker容器化部署Zabbix监控系统完整指南
docker·容器·zabbix
DolitD12 小时前
云流技术深度剖析:国内云渲染主流技术与开源和海外厂商技术实测对比
功能测试·云原生·开源·云计算·实时云渲染