k8s配置GPU感知:k8s-device-plugin的使用(已踩完坑)

1,定义

Kubernetes 的 NVIDIA 设备插件是一个 Daemonset,它允许自动:

  • 暴露集群中每个节点上的 GPU 数量
  • 跟踪 GPU 的运行状况
  • 在 Kubernetes 集群中运行支持 GPU 的容器

2,需要满足的前置条件

  • NVIDIA drivers ~= 384.81
  • nvidia-docker >= 2.0 || nvidia-container-toolkit >= 1.7.0 (>= 1.11.0 to use integrated GPUs on Tegra-based systems)
  • nvidia-container-runtime configured as the default low-level runtime
  • Kubernetes version >= 1.10

3,安装

bash 复制代码
kubect apply -f nvidia-device-plugin.yml

yaml内容如下:

复制代码
# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      containers:
      - image: 10.5.5.25:8080/nvidia/k8s-device-plugin:v0.17.0-ubi9
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

4,测试

4.1 配置yaml文件,跑一个job

复制代码
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

4.2 查看gpu-pod的log

5 遇到的问题

安装结束后,并没有发现GPU信息,通过查看/etc/docker/daemon,发现container toolkit也已经装好,但是运行docker info发现runtime还是runc,猜想可能就是这个原因,因此设置了default-runtime,如下:

复制代码
{
    "data-root":"/data/docker_data",
    "insecure-registries":[
        "192.168.237.50:8080",//私有仓库
        "127.0.0.0/8"
    ],
    "registry-mirrors":[
        "192.168.237.50:8080",//私有仓库
        "https://docker.m.daocloud.io",
        "https://docker.unsee.tech",
        "https://docker.1panel.live",
        "http://mirrors.ustc.edu.cn",
        "https://docker.chenby.cn",
        "http://mirror.azure.cn",
        "https://dockerpull.org",
        "https://dockerhub.icu",
        "https://hub.rat.dev",
        "https://proxy.1panel.live",
        "https://docker.1panel.top",
        "https://docker.m.daocloud.io",
        "https://docker.1ms.run",
        "https://docker.ketches.cn",
        "https://mirror,aliyuncs.com"
    ],
    "runtimes":{
        "nvidia":{
            "args":[],
            "path":"nvidia-container-runtime"
        }
    },
    "default-runtime":"nvidia"
}

最终实现了k8s调用GPU

相关推荐
java_logo12 小时前
Apache IoTDB Docker 容器化部署指南:从入门到生产环境实践
docker·容器·apache·iotdb·iotdb部署教程·iotdb部署文档·docker部署iotdb
百以国际食品有限公司12 小时前
中国奶茶原料珍珠粉圆
云原生
处女座_三月12 小时前
kubectl 命令行更新项目版本号
docker·容器·kubernetes
Cat God 00713 小时前
基于Docker的MySQL 主从复制(读写分离)
mysql·docker·容器
Selegant14 小时前
Kubernetes + Helm + ArgoCD:打造 GitOps 驱动的 Java 应用交付流水线
java·kubernetes·argocd
Jewel Q14 小时前
QEMU、KVM、Docker、K8s(Kubernetes)
docker·容器·kubernetes
学Linux的语莫15 小时前
prometheus、grafana的docker搭建
docker·容器·prometheus
lisanmengmeng15 小时前
zentao的prod环境升级(一)
linux·运维·数据库·docker·容器·禅道
永亮同学15 小时前
【探索实战】从零开始搭建Kurator分布式云原生平台:详细入门体验与功能实战分享!
分布式·云原生·交互
wuxingge17 小时前
k8s部署xxl-job
容器·kubernetes