k8s配置GPU感知:k8s-device-plugin的使用(已踩完坑)

1,定义

Kubernetes 的 NVIDIA 设备插件是一个 Daemonset,它允许自动:

  • 暴露集群中每个节点上的 GPU 数量
  • 跟踪 GPU 的运行状况
  • 在 Kubernetes 集群中运行支持 GPU 的容器

2,需要满足的前置条件

  • NVIDIA drivers ~= 384.81
  • nvidia-docker >= 2.0 || nvidia-container-toolkit >= 1.7.0 (>= 1.11.0 to use integrated GPUs on Tegra-based systems)
  • nvidia-container-runtime configured as the default low-level runtime
  • Kubernetes version >= 1.10

3,安装

bash 复制代码
kubect apply -f nvidia-device-plugin.yml

yaml内容如下:

复制代码
# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      containers:
      - image: 10.5.5.25:8080/nvidia/k8s-device-plugin:v0.17.0-ubi9
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

4,测试

4.1 配置yaml文件,跑一个job

复制代码
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

4.2 查看gpu-pod的log

5 遇到的问题

安装结束后,并没有发现GPU信息,通过查看/etc/docker/daemon,发现container toolkit也已经装好,但是运行docker info发现runtime还是runc,猜想可能就是这个原因,因此设置了default-runtime,如下:

复制代码
{
    "data-root":"/data/docker_data",
    "insecure-registries":[
        "192.168.237.50:8080",//私有仓库
        "127.0.0.0/8"
    ],
    "registry-mirrors":[
        "192.168.237.50:8080",//私有仓库
        "https://docker.m.daocloud.io",
        "https://docker.unsee.tech",
        "https://docker.1panel.live",
        "http://mirrors.ustc.edu.cn",
        "https://docker.chenby.cn",
        "http://mirror.azure.cn",
        "https://dockerpull.org",
        "https://dockerhub.icu",
        "https://hub.rat.dev",
        "https://proxy.1panel.live",
        "https://docker.1panel.top",
        "https://docker.m.daocloud.io",
        "https://docker.1ms.run",
        "https://docker.ketches.cn",
        "https://mirror,aliyuncs.com"
    ],
    "runtimes":{
        "nvidia":{
            "args":[],
            "path":"nvidia-container-runtime"
        }
    },
    "default-runtime":"nvidia"
}

最终实现了k8s调用GPU

相关推荐
资源开发与学习20 小时前
Kubernetes集群核心概念 Service
kubernetes
阿里云云原生21 小时前
【云栖大会】AI原生、AI可观测、AI Serverless、AI中间件,4场论坛20+议题公布!
云原生
容器魔方1 天前
Bloomberg 正式加入 Karmada 用户组!
云原生·容器·云计算
muyun28001 天前
Docker 下部署 Elasticsearch 8 并集成 Kibana 和 IK 分词器
elasticsearch·docker·容器
Nazi61 天前
k8s的dashboard
云原生·容器·kubernetes
傻傻虎虎1 天前
【Docker】常用帮忙、镜像、容器、其他命令合集(2)
运维·docker·容器
是小崔啊1 天前
叩丁狼K8s - 概念篇
云原生·容器·kubernetes
AKAMAI2 天前
Sport Network 凭借 Akamai 实现卓越成就
人工智能·云原生·云计算
ajax_beijing2 天前
zookeeper是啥
分布式·zookeeper·云原生
summer_west_fish2 天前
2023年系统分析师上半年论文试题分析
kubernetes