一键部署 GPU Kind 集群,体验 vLLM 极速推理

随着 Kubernetes 在大模型训练和推理领域的广泛应用,越来越多的开发者需要在本地环境中搭建支持 GPU 的 Kubernetes 集群,以便进行测试和开发。大家都知道,本地搭建 Kubernetes 集群通常可以使用 Kind(Kubernetes IN Docker),它轻量、快速且易于使用。但是,如果要搭建一个可以使用 GPU 的 Kind 集群,就需要进行一些额外的工作。比如,需要安装 GPU 驱动、配置 NVIDIA Container Toolkit、设置 NVIDIA Container Runtime 等等。此外,如果你希望在多个 Kind 节点之间均匀分配 GPU 资源,模拟一个多节点的 Kind GPU 集群,传统的 Kind 配置可能无法满足需求。

在这篇博客中,我们将介绍如何使用一键脚本快速搭建一个支持 GPU 的 Kind 集群。通过 nvkind,我们可以轻松地将 GPU 资源均匀分配到不同的 Kind 节点,从而模拟一个多节点的 Kind GPU 集群。

最后我们还会在 Kind 集群中部署一个 vLLM 大模型,来验证 Kind GPU 集群是否正常工作。

前提条件

完整的脚本可以在这里找到:https://github.com/cr7258/hands-on-lab/tree/main/ai/gpu/setup ,该脚本基于 Ubuntu 操作系统编写。

  • 具有 NVIDIA GPU 的机器
  • 管理员权限(sudo 访问权限)
  • 稳定的互联网连接

如果你在本地没有 GPU 服务器,也可以考虑在云服务提供商购买,比如阿里云、AWS、Azure 等。我个人觉得在阿里云上购买 GPU 实例的价格相对更实惠,不同地域的价格也有所差异,例如美国区域通常会更便宜一些。如果选择抢占式实例,价格还能进一步降低,非常适合用于开发测试等对稳定性要求不高的场景。

此外,阿里云的 ECS 实例在不使用时可以开启"节省停机"功能,这样可以保留磁盘数据,方便下次继续使用,同时还能节省 CPU、内存、GPU 等计算资源的费用。考虑到大模型通常占用较大的存储空间,建议在创建实例时将存储空间配置得稍微大一些,比如 100GB,以避免频繁扩容带来的麻烦。

执行脚本

只需执行以下命令,脚本会自动完成所有准备工作并创建一个支持 GPU 的 Kind 集群。

bash 复制代码
bash install.sh

脚本详解

接下来分别介绍脚本中的各个部分。

安装命令行工具

首先安装一些必要的命令行工具,包括 Docker、kubectl、Helm、Kind 和 nvkind。

bash 复制代码
sudo apt update
sudo apt install -y docker.io
sudo snap install kubectl --classic
# Add kubectl completion to bashrc
echo 'source <(kubectl completion bash)' >> ~/.bashrc
source ~/.bashrc
sudo snap install helm --classic

# Install kind
[ $(uname -m) = x86_64 ] && curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.25.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind

# Install nvkind
curl -L -o ~/nvkind-linux-amd64.tar.gz https://github.com/Jeffwan/kind-with-gpus-examples/releases/download/v0.1.0/nvkind-linux-amd64.tar.gz
tar -xzvf ~/nvkind-linux-amd64.tar.gz
mv nvkind-linux-amd64 /usr/local/bin/nvkind

安装 NVIDIA GPU 驱动程序

为了使用 GPU,我们需要安装 NVIDIA 驱动程序,这里使用的是 NVIDIA 驱动程序版本是 565.57.01。

bash 复制代码
wget https://cn.download.nvidia.com/tesla/565.57.01/NVIDIA-Linux-x86_64-565.57.01.run
sh NVIDIA-Linux-x86_64-565.57.01.run --silent

安装和配置 NVIDIA Container Toolkit

NVIDIA Container Toolkit 的主要作用是将 NVIDIA GPU 设备挂载到容器中。

bash 复制代码
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

然后配置 NVIDIA Container Runtime:

  • 第一条命令用于确保 Docker 已正确配置以配合 NVIDIA Toolkit 使用,并将 nvidia runtime 设置为默认值。
  • 第二条命令启用了 Toolkit 的一个功能标志,借助这一特性,我们可以将 GPU 支持注入到每个 Kind 的 worker 节点中。
bash 复制代码
sudo nvidia-ctk runtime configure --runtime=docker --set-as-default --cdi.enabled
sudo nvidia-ctk config --set accept-nvidia-visible-devices-as-volume-mounts=true --in-place
sudo systemctl restart docker

验证 GPU 可用性

通过多种方式验证 GPU 是否可用:

  • 运行 nvidia-smi 以检查 GPU 是否可用。
  • 使用 NVIDIA runtime 运行一个 Docker 容器,验证是否能识别 GPU。
  • 确保容器内能够访问 GPU 设备。
bash 复制代码
# Run nvidia-smi to list GPU devices
nvidia-smi -L
if [ $? -ne 0 ]; then
    echo "nvidia-smi failed to execute."
    exit 1
fi

# Run a Docker container with NVIDIA runtime to list GPU devices
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all ubuntu:20.04 nvidia-smi -L
if [ $? -ne 0 ]; then
    echo "Docker command with NVIDIA runtime failed to execute."
    exit 1
fi

# Run a Docker container with mounted /dev/null to check GPU accessibility
docker run -v /dev/null:/var/run/nvidia-container-devices/all ubuntu:20.04 nvidia-smi -L
if [ $? -ne 0 ]; then
    echo "Docker command with mounted /dev/null failed to execute."
    exit 1
fi

创建 Kind GPU 集群

首先生成一个 nvkind 配置文件,根据服务器上的 GPU 数量创建对应数量的 worker 节点,然后执行 nvkind 命令创建支持 GPU 的 Kind 集群。

bash 复制代码
cat << 'EOF' > one-worker-per-gpu.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
{{- range $gpu := until numGPUs }}
- role: worker
  extraMounts:
    # We inject all NVIDIA GPUs using the nvidia-container-runtime.
    # This requires `accept-nvidia-visible-devices-as-volume-mounts = true` be set
    # in `/etc/nvidia-container-runtime/config.toml`
    - hostPath: /dev/null
      containerPath: /var/run/nvidia-container-devices/{{ $gpu }}
{{- end }}
EOF

nvkind cluster create --name gpu-cluster --config-template=one-worker-per-gpu.yaml

安装 GPU Operator

GPU Operator 是 NVIDIA 提供的一个 Kubernetes Operator,它简化了在 Kubernetes 集群中使用 GPU 的过程,通过自动化的方式处理 GPU 驱动程序安装、NVIDIA Device PluginDCGM Exporter 等组件。

bash 复制代码
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator \
     # 我们之前已经安装了 NVIDIA 驱动程序,所以在 GPU Operator 中禁用驱动安装
     --set driver.enabled=false

安装 Cloud Provider Kind

最后脚本安装了 Cloud Provider Kind,允许以 LoadBalancer 的模式暴露服务,方便从 Kind 集群外部进行访问。

bash 复制代码
curl -L ${KIND_CLOUD_PROVIDER_URL} -o cloud-provider-kind.tar.gz
tar -xvzf cloud-provider-kind.tar.gz
chmod +x cloud-provider-kind
sudo mv cloud-provider-kind /usr/local/bin/

# Run cloud-provider-kind in the background and forward logs
echo "Starting cloud-provider-kind in the background..."
LOG_FILE="/tmp/cloud-provider-kind.log"

nohup cloud-provider-kind > ${LOG_FILE} 2>&1 &

# Save the process ID
echo $! > /tmp/cloud-provider-kind.pid
echo "Cloud Provider Kind is running in the background. Logs are being written to ${LOG_FILE}."

echo "Setup complete. All components have been installed successfully."

通过 vLLM 运行大模型

接下来我们通过 vLLM 来运行 deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B 模型来测试 Kind GPU 集群是否正常工作。vLLM 是一个快速且易于使用的库,用于 LLM 推理和服务,可以和 HuggingFace 无缝集成。

yaml 复制代码
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: deepseek-r1-distill-qwen-1-5b
  namespace: default
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
  volumeMode: Filesystem
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: deepseek-r1-distill-qwen-1-5b
  namespace: default
  labels:
    app: deepseek-r1-distill-qwen-1-5b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: deepseek-r1-distill-qwen-1-5b
  template:
    metadata:
      labels:
        app: deepseek-r1-distill-qwen-1-5b
    spec:
      volumes:
        - name: cache-volume
          persistentVolumeClaim:
            claimName: deepseek-r1-distill-qwen-1-5b
      containers:
      - name: deepseek-r1-distill-qwen-1-5b
        image: vllm/vllm-openai:latest
        command: ["/bin/sh", "-c"]
        args: [
          "vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
        ]
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: "1"
          requests:
            nvidia.com/gpu: "1"
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 5
        volumeMounts:
          - mountPath: /root/.cache/huggingface
            name: cache-volume
---
apiVersion: v1
kind: Service
metadata:
  name: deepseek-r1-distill-qwen-1-5b
  namespace: default
spec:
  ports:
  - name: deepseek-r1-distill-qwen-1-5b
    port: 80
    protocol: TCP
    targetPort: 8000
  selector:
    app: deepseek-r1-distill-qwen-1-5b
  type: LoadBalancer

vllm/vllm-openai:latest 的镜像大概有 8GB,需要一些时间来下载,另外 vLLM 启动以后还需要去 HuggingFace 下载模型权重。首次启动 Pod 时,下载模型权重花费了 508 秒。由于模型权重被缓存在 PVC 中,并挂载到 /root/.cache/huggingface 目录,后续 Pod 重启时将无需重新下载权重,能够显著加快启动速度。

bash 复制代码
> kubectl logs deepseek-r1-distill-qwen-1-5b-66dc67d4dc-86b2v -f
INFO 03-22 12:14:52 [__init__.py:256] Automatically detected platform cuda.
INFO 03-22 12:14:56 [api_server.py:977] vLLM API server version 0.8.1
INFO 03-22 12:14:56 [api_server.py:978] args: Namespace(subparser='serve', model_tag='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=1024, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=True, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x701e1ce19a80>)
INFO 03-22 12:15:09 [config.py:583] This model supports multiple tasks: {'score', 'embed', 'reward', 'generate', 'classify'}. Defaulting to 'generate'.
INFO 03-22 12:15:09 [config.py:1693] Chunked prefill is enabled with max_num_batched_tokens=1024.
INFO 03-22 12:15:16 [__init__.py:256] Automatically detected platform cuda.
INFO 03-22 12:15:18 [core.py:53] Initializing a V1 LLM engine (v0.8.1) with config: model='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 03-22 12:15:20 [utils.py:2282] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x741a9f769130>
INFO 03-22 12:15:20 [parallel_state.py:967] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 03-22 12:15:20 [cuda.py:215] Using Flash Attention backend on V1 engine.
INFO 03-22 12:15:20 [gpu_model_runner.py:1164] Starting to load model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B...
INFO 03-22 12:15:21 [topk_topp_sampler.py:53] Using FlashInfer for top-p & top-k sampling.
INFO 03-22 12:15:21 [weight_utils.py:257] Using model weights format ['*.safetensors']

# 下载模型权重花费 508 秒
INFO 03-22 12:23:49 [weight_utils.py:273] Time spent downloading weights for deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B: 508.576789 seconds
INFO 03-22 12:23:49 [weight_utils.py:307] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.02it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.02it/s]

INFO 03-22 12:23:50 [loader.py:429] Loading weights took 0.56 seconds
INFO 03-22 12:23:50 [gpu_model_runner.py:1176] Model loading took 3.3465 GB and 509.679959 seconds
INFO 03-22 12:23:56 [backends.py:409] Using cache directory: /root/.cache/vllm/torch_compile_cache/b4fe10a8ad/rank_0_0 for vLLM's torch.compile
INFO 03-22 12:23:56 [backends.py:419] Dynamo bytecode transform time: 6.04 s
INFO 03-22 12:23:59 [backends.py:132] Cache the graph of shape None for later use
INFO 03-22 12:24:19 [backends.py:144] Compiling a graph for general shape takes 22.07 s
INFO 03-22 12:24:27 [monitor.py:33] torch.compile takes 28.11 s in total
INFO 03-22 12:24:28 [kv_cache_utils.py:537] GPU KV cache size: 525,520 tokens
INFO 03-22 12:24:28 [kv_cache_utils.py:540] Maximum concurrency for 131,072 tokens per request: 4.01x
INFO 03-22 12:24:49 [gpu_model_runner.py:1499] Graph capturing finished in 21 secs, took 0.45 GiB
INFO 03-22 12:24:50 [core.py:138] init engine (profile, create kv cache, warmup model) took 59.57 seconds
INFO 03-22 12:24:50 [serving_chat.py:115] Using default chat sampling params from model: {'temperature': 0.6, 'top_p': 0.95}
INFO 03-22 12:24:50 [serving_completion.py:61] Using default completion sampling params from model: {'temperature': 0.6, 'top_p': 0.95}
INFO 03-22 12:24:50 [api_server.py:1024] Starting vLLM API server on http://0.0.0.0:8000
INFO 03-22 12:24:50 [launcher.py:26] Available routes are:
INFO 03-22 12:24:50 [launcher.py:34] Route: /openapi.json, Methods: GET, HEAD
INFO 03-22 12:24:50 [launcher.py:34] Route: /docs, Methods: GET, HEAD
INFO 03-22 12:24:50 [launcher.py:34] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 03-22 12:24:50 [launcher.py:34] Route: /redoc, Methods: GET, HEAD
INFO 03-22 12:24:50 [launcher.py:34] Route: /health, Methods: GET
INFO 03-22 12:24:50 [launcher.py:34] Route: /load, Methods: GET
INFO 03-22 12:24:50 [launcher.py:34] Route: /ping, Methods: GET, POST
INFO 03-22 12:24:50 [launcher.py:34] Route: /tokenize, Methods: POST
INFO 03-22 12:24:50 [launcher.py:34] Route: /detokenize, Methods: POST
INFO 03-22 12:24:50 [launcher.py:34] Route: /v1/models, Methods: GET
INFO 03-22 12:24:50 [launcher.py:34] Route: /version, Methods: GET
INFO 03-22 12:24:50 [launcher.py:34] Route: /v1/chat/completions, Methods: POST
INFO 03-22 12:24:50 [launcher.py:34] Route: /v1/completions, Methods: POST
INFO 03-22 12:24:50 [launcher.py:34] Route: /v1/embeddings, Methods: POST
INFO 03-22 12:24:50 [launcher.py:34] Route: /pooling, Methods: POST
INFO 03-22 12:24:50 [launcher.py:34] Route: /score, Methods: POST
INFO 03-22 12:24:50 [launcher.py:34] Route: /v1/score, Methods: POST
INFO 03-22 12:24:50 [launcher.py:34] Route: /v1/audio/transcriptions, Methods: POST
INFO 03-22 12:24:50 [launcher.py:34] Route: /rerank, Methods: POST
INFO 03-22 12:24:50 [launcher.py:34] Route: /v1/rerank, Methods: POST
INFO 03-22 12:24:50 [launcher.py:34] Route: /v2/rerank, Methods: POST
INFO 03-22 12:24:50 [launcher.py:34] Route: /invocations, Methods: POST
INFO:     Started server process [7]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

确认 Pod 状态已经 Ready。

bash 复制代码
> kubectl get pod 
NAME                                             READY   STATUS    RESTARTS   AGE
deepseek-r1-distill-qwen-1-5b-66dc67d4dc-86b2v   1/1     Running   0          26m

> kubectl get svc
NAME                            TYPE           CLUSTER-IP     EXTERNAL-IP   PORT(S)        AGE
deepseek-r1-distill-qwen-1-5b   LoadBalancer   10.96.11.243   172.18.0.4    80:31629/TCP   26m

通过 LoadBalancer 暴露的 EXTERNAL-IP 来访问大模型。

bash 复制代码
curl --location 'http://172.18.0.4/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model":"deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
    "messages": [
      {
        "role": "user",
        "content": "你是谁?"
      }
    ]
}'

响应结果如下,可以看到大模型成功处理了我们的请求。

json 复制代码
{
  "id": "chatcmpl-06fde7b60b5642b5a2100b6146d351fa",
  "object": "chat.completion",
  "created": 1742646615,
  "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": null,
        "content": "我是DeepSeek-R1,一个由深度求索公司开发的智能助手,我擅长通过思考来帮您解答复杂的数学,代码和逻辑推理等理工类问题。\n</think>\n\n我是DeepSeek-R1,一个由深度求索公司开发的智能助手,我擅长通过思考来帮您解答复杂的数学,代码和逻辑推理等理工类问题。",
        "tool_calls": [
          
        ]
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 85,
    "completion_tokens": 77,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null
}

如果我们重启 Pod,这次会发现加载模型权重仅用了 0.55 秒,说明模型权重已经缓存起来了。

bash 复制代码
INFO 03-22 12:51:31 [weight_utils.py:257] Using model weights format ['*.safetensors']
INFO 03-22 12:51:31 [weight_utils.py:307] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.06it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.06it/s]

INFO 03-22 12:51:31 [loader.py:429] Loading weights took 0.55 seconds
INFO 03-22 12:51:31 [gpu_model_runner.py:1176] Model loading took 3.3465 GB and 0.912167 seconds
INFO 03-22 12:51:37 [backends.py:409] Using cache directory: /root/.cache/vllm/torch_compile_cache/b4fe10a8ad/rank_0_0 for vLLM's torch.compile
INFO 03-22 12:51:37 [backends.py:419] Dynamo bytecode transform time: 5.92 s
INFO 03-22 12:51:40 [backends.py:132] Cache the graph of shape None for later use
INFO 03-22 12:52:00 [backends.py:144] Compiling a graph for general shape takes 22.21 s
INFO 03-22 12:52:09 [monitor.py:33] torch.compile takes 28.13 s in total
INFO 03-22 12:52:10 [kv_cache_utils.py:537] GPU KV cache size: 525,520 tokens
INFO 03-22 12:52:10 [kv_cache_utils.py:540] Maximum concurrency for 131,072 tokens per request: 4.01x
INFO 03-22 12:52:31 [gpu_model_runner.py:1499] Graph capturing finished in 21 secs, took 0.45 GiB
INFO 03-22 12:52:31 [core.py:138] init engine (profile, create kv cache, warmup model) took 59.32 seconds
INFO 03-22 12:52:31 [serving_chat.py:115] Using default chat sampling params from model: {'temperature': 0.6, 'top_p': 0.95}
INFO 03-22 12:52:31 [serving_completion.py:61] Using default completion sampling params from model: {'temperature': 0.6, 'top_p': 0.95}
INFO 03-22 12:52:31 [api_server.py:1024] Starting vLLM API server on http://0.0.0.0:8000

清理

完成测试后,如果你想删除集群,可以执行以下命令:

bash 复制代码
bash cleanup.sh

总结

本文介绍了如何通过一键脚本在本地快速搭建支持 GPU 的 Kind 集群,适用于大模型的开发与测试场景。利用 nvkind 工具,可以轻松实现多节点 GPU 资源分配,并结合 vLLM 成功部署了 deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B 模型。

欢迎关注

相关推荐
桂月二二3 小时前
云原生无服务器计算:事件驱动的原子化运算革命
云原生·serverless
委婉待续3 小时前
在linux上启动微服务
微服务·云原生·架构
阿里云云原生7 小时前
如何在云效中使用 DeepSeek 等大模型实现 AI 智能评审
云原生
阿里云大数据AI技术9 小时前
在EMR Serverless Spark中实现StarRocks读写操作
大数据·分布式·云原生·spark·serverless·emr
云上艺旅10 小时前
K8S学习之基础四十四:k8s中部署Kibana
学习·云原生·容器·kubernetes
阿里云云原生11 小时前
从 DeepSeek 敏感信息泄露谈可观测系统的数据安全预防
云原生
编程在手天下我有11 小时前
微服务 - 中级篇
微服务·云原生·架构
林润庚13 小时前
k8s kubernetes dashboard一直CarshLoopBackoff
云原生·容器·kubernetes
孔令飞16 小时前
简单粗暴:如何写一篇高质量的技术文章?
人工智能·云原生·go
YAMLMaster16 小时前
K8s 跨集群通信的“量子纠缠”:当 DNS 黑洞吞没你的服务请求
网络·云原生·容器·kubernetes·devops