算力自由:用K8s和Ollama打造你的专属AI基础设施

作者:吴业亮
博客:wuyeliang.blog.csdn.net

一、准备环境

1、删除可能存在的Nvidia驱动

复制代码
apt-get remove --purge nvidia*

2、禁用原有开源显卡驱动

复制代码
vi /etc/modprobe.d/blacklist-nouveau.conf

然后将下面内容添加进去[存在则追加],即将nouveau驱动拉黑

复制代码
blacklist nouveau
options nouveau modeset=0

保存后,执行如下命令更新改动

复制代码
update-initramfs -u

执行sudo reboot重启电脑,nouveau驱动应该已被禁用,执行下面命令没有输出才对

复制代码
lsmod | grep nouveau

3、安装gcc

复制代码
apt install gcc g++ make -y 

二、安装驱动

1、确认显卡型号

查明你的NVIDIA显卡型号,以确保下载驱动程序的版本:

复制代码
lspci | grep -i vga
root@ubuntu:~# lspci | grep -i vga
03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. Integrated Matrox G200eW3 Graphics Controller (rev 04)
af:00.0 VGA compatible controller: NVIDIA Corporation TU104GL [Quadro RTX 4000] (rev a1)

NVIDIA官网https://www.nvidia.cn/Download/index.aspx?lang=cn#

例如:

复制代码
# wget https://us.download.nvidia.com/XFree86/Linux-x86_64/550.120/NVIDIA-Linux-x86_64-550.120.run

进入之前下载驱动文件的目录,执行如下命令

换成自己下载的文件名

复制代码
sudo ./NVIDIA-Linux-x86_64-450.80.02.run

验证安装

复制代码
root@ubuntu:~# nvidia-smi
Mon Oct 14 08:55:46 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro RTX 4000                Off |   00000000:AF:00.0 Off |                  N/A |
| 30%   27C    P8              9W /  125W |      97MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
 
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     56201      C   python3                                        94MiB |
+-----------------------------------------------------------------------------------------+

卸载驱动

复制代码
# sh ./NVIDIA-Linux-x86_64-550.120.run --uninstall

三、安装kubersphere和k8s

复制代码
sudo apt-get update
sudo apt install socat conntrack ebtables ipset jq -y
export KKZONE=cn

最新版本

复制代码
curl -sfL https://get-kk.kubesphere.io | sh -

指定版本

复制代码
curl -sfL https://get-kk.kubesphere.io | VERSION=v3.1.10 sh -

注意:

通过这里可以看到kubekey版本

复制代码
https://github.com/kubesphere/kubekey/releases/

开始安装kubersphere和k8s

复制代码
./kk create cluster --with-kubernetes v1.28.12 --with-kubesphere v3.4.1
./kk create cluster --with-kubernetes v1.30.10 --with-kubesphere v4.1.3
或者:
./kk create cluster --with-kubernetes v1.30.10 

......

Installation is complete.

Please check the result using the command:

        kubectl get pod -A

./kk create cluster  v4.1.3

KubeSphere v4.1.3 经测试支持的 Kubernetes 版本为 v1.21~v1.30,未经测试的版本不做兼容保证

查询k8s版本

复制代码
https://github.com/kubernetes/kubernetes/releases

查询kubesphere版本

复制代码
https://github.com/kubesphere/kubesphere/releases

四、安装 gpu-operator

1、前提条件

Node Feature Discovery (NFD) 检测功能检查。

复制代码
# kubectl get nodes -o json | jq '.items[].metadata.labels | keys | any(startswith("feature.node.kubernetes.io"))'

上面的命令执行结果为 true, 说明 NFD 已经在集群中运行。如果NFD已经在集群中运行,那么在安装 Operator 时必须禁用部署 NFD。

说明: 使用 KubeSphere 部署的 K8s 集群默认不会安装配置 NFD。

2、设置代理

设置代理

对所有协议(包括 HTTP、HTTPS、FTP 等)的请求都生效

复制代码
export all_proxy=http://192.168.1.119:7890
export NO_PROXY=localhost,127.0.0.1,.svc,.cluster.local,192.168.1.247,lb.kubesphere.local

测试

复制代码
root@ubuntu:~# curl google.com
root@ubuntu:~# curl http://google.com
root@ubuntu:~# echo $?
0

设置docker代理

复制代码
root@ubuntu:~# systemctl  status  docker | grep system | grep enabled
     Loaded: loaded (/etc/systemd/system/docker.service; enabled; vendor preset: enabled)



mkdir /etc/systemd/system/docker.service.d
rm -rf /etc/systemd/system/docker.service.d/https-proxy.conf

cat <<EOF>> /etc/systemd/system/docker.service.d/https-proxy.conf
[Service]
Environment="HTTPS_PROXY=http://192.168.1.119:7890/"
Environment="NO_PROXY=localhost,127.0.0.0/8,docker-registry.somecorporation.com,172.16.0.0/16,192.168.1.0/24"
EOF

systemctl daemon-reload
service docker restart

测试

复制代码
docker search mysql

3、安装 NVIDIA GPU Operator

添加 NVIDIA Helm repository

复制代码
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update

5、安装 GPU Operator

离线下载gpu-operator

复制代码
helm fetch nvidia/gpu-operator --version v25.3.1



root@ubuntu:~# kubectl create namespace gpu-operator
namespace/gpu-operator created

使用默认配置文件,禁用自动安装显卡驱动功能,安装 GPU Operator。

复制代码
root@ubuntu:~# helm install --wait --generate-name -n gpu-operator  ./gpu-operator-v25.3.1.tgz --set driver.enabled=false
NAME: gpu-operator-v25-1750221680
LAST DEPLOYED: Wed Jun 18 04:41:20 2025
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None

验证

复制代码
# kubectl get pods -n gpu-operator

NAME                                                          READY   STATUS      RESTARTS   AGE

gpu-feature-discovery-czdf5                                   1/1     Running     0          15m

gpu-feature-discovery-q9qlm                                   1/1     Running     0          15m

gpu-operator-67c68ddccf-x29pm                                 1/1     Running     0          15m

gpu-operator-node-feature-discovery-gc-57457b6d8f-zjqhr       1/1     Running     0          15m

gpu-operator-node-feature-discovery-master-5fb74ff754-fzbzm   1/1     Running     0          15m

gpu-operator-node-feature-discovery-worker-68459              1/1     Running     0          15m

gpu-operator-node-feature-discovery-worker-74ps5              1/1     Running     0          15m

gpu-operator-node-feature-discovery-worker-dpmg9              1/1     Running     0          15m

gpu-operator-node-feature-discovery-worker-jvk4t              1/1     Running     0          15m

gpu-operator-node-feature-discovery-worker-k5kwq              1/1     Running     0          15m

gpu-operator-node-feature-discovery-worker-ll4bk              1/1     Running     0          15m

gpu-operator-node-feature-discovery-worker-p4q5q              1/1     Running     0          15m

gpu-operator-node-feature-discovery-worker-rmk99              1/1     Running     0          15m

nvidia-container-toolkit-daemonset-9zcnj                      1/1     Running     0          15m

nvidia-container-toolkit-daemonset-kcz9g                      1/1     Running     0          15m

nvidia-cuda-validator-l8vjb                                   0/1     Completed   0          14m

nvidia-cuda-validator-svn2p                                   0/1     Completed   0          13m

nvidia-dcgm-exporter-9lq4c                                    1/1     Running     0          15m

nvidia-dcgm-exporter-qhmkg                                    1/1     Running     0          15m

nvidia-device-plugin-daemonset-7rvfm                          1/1     Running     0          15m

nvidia-device-plugin-daemonset-86gx2                          1/1     Running     0          15m

nvidia-operator-validator-csr2z                               1/1     Running     0          15m

nvidia-operator-validator-svlc4                               1/1     Running     0          15m

五、GPU 功能验证测试

1、 测试示例1-验证测试 CUDA

GPU Operator 正确安装完成后,使用 CUDA 基础镜像,测试 K8s 是否能正确创建使用 GPU 资源的 Pod。

创建资源清单文件cuda-ubuntu.yaml

复制代码
apiVersion: v1
kind: Pod
metadata:
  name: cuda-ubuntu2204
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-ubuntu2204
    image: "nvcr.io/nvidia/cuda:12.4.0-base-ubuntu22.04"
    resources:
      limits:
        nvidia.com/gpu: 1
    command: ["nvidia-smi"]

创建资源

复制代码
kubectl apply -f cuda-ubuntu.yaml

查看创建的资源

从结果中可以看到 pod 创建在了 ksp-gpu-worker-2 节点(该节点显卡型号 Tesla P100-PCIE-16GB)。

复制代码
# kubectl get pods -o wide
NAME                      READY   STATUS      RESTARTS   AGE   IP             NODE               NOMINATED NODE   READINESS GATES
cuda-ubuntu2204           0/1     Completed   0          73s   10.233.99.15   ksp-gpu-worker-2   <none>           <none>
ollama-79688d46b8-vxmhg   1/1     Running     0          47m   10.233.72.17   ksp-gpu-worker-1   <none>           <none>

查看 Pod 日志

复制代码
kubectl logs pod/cuda-ubuntu2204

正确执行输出结果如下:

复制代码
# kubectl logs pod/cuda-ubuntu2204
Mon Jul  8 11:10:59 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla P100-PCIE-16GB           Off |   00000000:00:10.0 Off |                    0 |
| N/A   40C    P0             26W /  250W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

清理测试资源

复制代码
kubectl delete -f cuda-ubuntu.yaml

2、测试示例 2-官方 GPU Applications 示例

执行一个简单的 CUDA 示例,用于将两个向量(vectors)相加。

创建资源清单文件cuda-vectoradd.yaml

复制代码
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
    resources:
      limits:
        nvidia.com/gpu: 1

执行命令创建 Pod

复制代码
# kubectl apply -f cuda-vectoradd.yaml

查看 Pod 执行结果

Pod 创建成功,启动后会运行 vectorAdd 命令并退出。

复制代码
# kubectl logs pod/cuda-vectoradd

正确执行输出结果如下:

复制代码
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

清理测试资源

复制代码
kubectl delete -f cuda-vectoradd.yaml

六、 KubeSphere 部署 Ollama

通过上面的验证测试,证明可以在 K8s 集群上创建使用 GPU 的 Pod 资源,接下来我们结合实际使用需求,利用 KubeSphere 在 K8s 集群创建一套大模型管理工具 Ollama。

1、创建部署资源清单

本示例属于简单测试,存储选择了 hostPath 模式,实际使用中请替换为存储类或是其他类型的持久化存储。

创建资源清单deploy-ollama.yaml

复制代码
kind: Deployment
apiVersion: apps/v1
metadata:
  name: ollama
  namespace: default
  labels:
    app: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      volumes:
        - name: ollama-models
          hostPath:
            path: /data/openebs/local/ollama
            type: ''
        - name: host-time
          hostPath:
            path: /etc/localtime
            type: ''
      containers:
        - name: ollama
          image: 'ollama/ollama:latest'
          ports:
            - name: http-11434
              containerPort: 11434
              protocol: TCP
          resources:
            limits:
              nvidia.com/gpu: '1'
            requests:
              nvidia.com/gpu: '1'
          volumeMounts:
            - name: ollama-models
              mountPath: /root/.ollama
            - name: host-time
              readOnly: true
              mountPath: /etc/localtime
          imagePullPolicy: IfNotPresent
      restartPolicy: Always
---
kind: Service
apiVersion: v1
metadata:
  name: ollama
  namespace: default
  labels:
    app: ollama
spec:
  ports:
    - name: http-11434
      protocol: TCP
      port: 11434
      targetPort: 11434
      nodePort: 31434
  selector:
    app: ollama
  type: NodePort

特殊说明: KubeSphere 的管理控制台支持图形化配置 Deployment 等资源使用 GPU 资源,配置示例如下,感兴趣的朋友可以自行研究。

2、部署 Ollama 服务

创建 Ollama

复制代码
kubectl apply -f deploy-ollama.yaml

查看 Pod 创建结果

从结果中可以看到 pod 创建在了 ksp-gpu-worker-1 节点(该节点显卡型号 Tesla M40 24GB)。

复制代码
# kubectl get pods -o wide
NAME                      READY   STATUS    RESTARTS   AGE   IP             NODE               NOMINATED NODE   READINESS GATES
k   1/1     Running   0          12s   10.233.72.17   ksp-gpu-worker-1   <none>           <none>

查看容器 log

复制代码
kubectl logs ollama-79688d46b8-vxmhg
2024/07/08 18:24:27 routes.go:1064: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-07-08T18:24:27.829+08:00 level=INFO source=images.go:730 msg="total blobs: 5"
time=2024-07-08T18:24:27.829+08:00 level=INFO source=images.go:737 msg="total unused blobs removed: 0"
time=2024-07-08T18:24:27.830+08:00 level=INFO source=routes.go:1111 msg="Listening on [::]:11434 (version 0.1.48)"
time=2024-07-08T18:24:27.830+08:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama2414166698/runners
time=2024-07-08T18:24:32.454+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11 rocm_v60101]"
time=2024-07-08T18:24:32.567+08:00 level=INFO source=types.go:98 msg="inference compute" id=GPU-9e48dc13-f8f1-c6bb-860f-c82c96df22a4 library=cuda compute=5.2 driver=12.4 name="Tesla M40 24GB" total="22.4 GiB" available="22.3 GiB"

4、拉取 Ollama 使用的大模型

Ollama 拉取模型

本示例为了节省时间,采用阿里开源的 qwen2 1.5b 小尺寸模型作为测试模型。

复制代码
kubectl exec -it ollama-79688d46b8-vxmhg -- ollama pull qwen2:1.5b

正确执行输出结果如下:

复制代码
# kubectl exec -it ollama-79688d46b8-vxmhg -- ollama pull qwen2:1.5b
pulling manifest
pulling 405b56374e02... 100% ▕█████████████████████████████████████████████████████▏ 934 MB
pulling 62fbfd9ed093... 100% ▕█████████████████████████████████████████████████████▏  182 B
pulling c156170b718e... 100% ▕█████████████████████████████████████████████████████▏  11 KB
pulling f02dd72bb242... 100% ▕█████████████████████████████████████████████████████▏   59 B
pulling c9f5e9ffbc5f... 100% ▕█████████████████████████████████████████████████████▏  485 B
verifying sha256 digest
writing manifest
removing any unused layers
success

查看模型文件的内容

在 ksp-gpu-worker-1 节点执行下面的查看命令

复制代码
$ ls -R /data/openebs/local/ollama/
/data/openebs/local/ollama/:
id_ed25519  id_ed25519.pub  models

/data/openebs/local/ollama/models:
blobs  manifests

/data/openebs/local/ollama/models/blobs:
sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e
sha256-62fbfd9ed093d6e5ac83190c86eec5369317919f4b149598d2dbb38900e9faef
sha256-c156170b718ec29139d3653d40ed1986fd92fb7e0959b5c71f3c48f62e6636f4
sha256-c9f5e9ffbc5f14febb85d242942bd3d674a8e4c762aaab034ec88d6ba839b596
sha256-f02dd72bb2423204352eabc5637b44d79d17f109fdb510a7c51455892aa2d216

/data/openebs/local/ollama/models/manifests:
registry.ollama.ai

/data/openebs/local/ollama/models/manifests/registry.ollama.ai:
library

/data/openebs/local/ollama/models/manifests/registry.ollama.ai/library:
qwen2

/data/openebs/local/ollama/models/manifests/registry.ollama.ai/library/qwen2:
1.5b

4、模型能力测试

调用接口测试

复制代码
curl http://192.168.9.91:31434/api/chat -d '{
  "model": "qwen2:1.5b",
  "messages": [
    { "role": "user", "content": "用20个字,介绍你自己" }
  ]
}'

测试结果

复制代码
$ curl http://192.168.9.91:31434/api/chat -d '{
  "model": "qwen2:1.5b",
  "messages": [
    { "role": "user", "content": "用20个字,介绍你自己" }
  ]
}'
{"model":"qwen2:1.5b","created_at":"2024-07-08T09:54:48.011798927Z","message":{"role":"assistant","content":"我"},"done":false}
{"model":"qwen2:1.5b","created_at":"2024-07-08T09:54:48.035291669Z","message":{"role":"assistant","content":"是一个"},"done":false}
{"model":"qwen2:1.5b","created_at":"2024-07-08T09:54:48.06360233Z","message":{"role":"assistant","content":"人工智能"},"done":false}
{"model":"qwen2:1.5b","created_at":"2024-07-08T09:54:48.092411266Z","message":{"role":"assistant","content":"助手"},"done":false}
{"model":"qwen2:1.5b","created_at":"2024-07-08T09:54:48.12016935Z","message":{"role":"assistant","content":","},"done":false}
{"model":"qwen2:1.5b","created_at":"2024-07-08T09:54:48.144921623Z","message":{"role":"assistant","content":"专注于"},"done":false}
{"model":"qwen2:1.5b","created_at":"2024-07-08T09:54:48.169803961Z","message":{"role":"assistant","content":"提供"},"done":false}
{"model":"qwen2:1.5b","created_at":"2024-07-08T09:54:48.194796364Z","message":{"role":"assistant","content":"信息"},"done":false}
{"model":"qwen2:1.5b","created_at":"2024-07-08T09:54:48.21978104Z","message":{"role":"assistant","content":"和"},"done":false}
{"model":"qwen2:1.5b","created_at":"2024-07-08T09:54:48.244976103Z","message":{"role":"assistant","content":"帮助"},"done":false}
{"model":"qwen2:1.5b","created_at":"2024-07-08T09:54:48.270233992Z","message":{"role":"assistant","content":"。"},"done":false}
{"model":"qwen2:1.5b","created_at":"2024-07-08T09:54:48.29548561Z","message":{"role":"assistant","content":""},"done_reason":"stop","done":true,"total_duration":454377627,"load_duration":1535754,"prompt_eval_duration":36172000,"eval_count":12,"eval_duration":287565000}

5、查看 GPU 分配信息

查看 Worker 节点已分配的 GPU 资源

复制代码
$ kubectl describe node ksp-gpu-worker-1 | grep "Allocated resources" -A 9
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests        Limits
  --------           --------        ------
  cpu                487m (13%)      2 (55%)
  memory             315115520 (2%)  800Mi (5%)
  ephemeral-storage  0 (0%)          0 (0%)
  hugepages-1Gi      0 (0%)          0 (0%)
  hugepages-2Mi      0 (0%)          0 (0%)
  nvidia.com/gpu     1               1
Ollama 运行时物理 GPU 使用情况
相关推荐
*星星之火*1 小时前
【大白话 AI 答疑】第6篇 大模型指令微调:instruction/input/output核心解析及案例
服务器·前端·人工智能
元智启1 小时前
企业级AI智能体开发:从概念到落地的关键技术实践
人工智能
AI指北1 小时前
每周AI看 | 亚马逊推出AI产品矩阵、网易云商客服Agent项目收录至《2025年中国数字服务产业发展白皮书》
人工智能·ai·ai agent·ai热点
skywalk81631 小时前
GLM-edge-1.5B-chat 一个特别的cpu可以推理的小型llm模型
人工智能·ollama·llama.cpp
TsingtaoAI1 小时前
TsingtaoAI荣膺2025澳门首届DSA国际创新创业大赛奖项,RISC-V AI机器人引领行业新突破
人工智能·机器人·risc-v
CClaris1 小时前
手撕 LSTM:用 NumPy 从零实现 LSTM 前向传播
人工智能·numpy·lstm
夜幕龙2 小时前
宇树 G1 部署(十一)——遥操作脚本升级 teleop_hand_and_arm_update.py
人工智能·机器人·具身智能
币之互联万物2 小时前
聚焦新质生产力 科技与金融深度融合赋能创新
人工智能·科技·金融
viperrrrrrrrrr72 小时前
AI音色克隆
人工智能·深度学习·语音识别