Ubuntu22.04安装Qwen3大模型

文章目录

概述

华为服务器安装单卡3090,部署千问3量化版模型,记录部署过程。

环境

  1. 服务
  • 华为2288H V5
    • 2颗 CPU
    • 256G 内存
    • 300G 15K*2(RAID1)系统盘
    • 1T SSD*2(RAID1)服务和模型存储盘
    • RTX 3090 24G
    • 900W电源*2(冗余)
  1. 操作系统
  • Ubuntu22.04
  1. Python
  • pyenv python3.12
  1. Nvidia驱动
  • NVIDIA-SMI 595.71.05
  1. Cude
  • cuda_13.0
  1. vLLM
  • vllm 0.24
  1. 大模型
  • Qwen3-14B-AWQ (当前使用)
  • Qwen3-8B-AWQ
  • Qwen3-4B
  • Qwen3.5-4B

服务器安装GPU

  1. 华为服务器安装3090显卡,需要Riser转接卡(需要带供电口的),配合服务器电源使用。
  2. Riser转接卡供电口是8P的,请配好显卡 - Riser卡的电源线
  3. 华为服务器最多可以插入2块GPU,一定要核算好服务器电源功率

安装操作系统

安装Ubuntu22.04(步骤略)

安装Nvidia驱动

检查显卡驱动

  1. ubuntu-drivers命令需要用aplay
bash 复制代码
# 安装aplay
sudo apt install alsa-utils
  1. 检查可用驱动
bash 复制代码
sudo ubuntu-drivers devices
bash 复制代码
== /sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0 ==
modalias : pci:v000010DEd00002204sv00001458sd0000403Bbc03sc00i00
vendor   : NVIDIA Corporation
model    : GA102 [GeForce RTX 3090]
......
driver   : nvidia-driver-535 - third-party non-free
driver   : nvidia-driver-595 - distro non-free
driver   : nvidia-driver-580-server - distro non-free
driver   : nvidia-driver-535-server-open - distro non-free
driver   : nvidia-driver-595-open - distro non-free recommended
driver   : nvidia-driver-580-server-open - distro non-free
......
driver   : xserver-xorg-video-nouveau - distro free builtin
  1. 安装推荐驱动(recommended)
bash 复制代码
sudo apt install nvidia-driver-595-open
  1. 禁用自动升级
bash 复制代码
sudo apt-mark hold nvidia-driver-595-open
  1. 安装后重启
bash 复制代码
reboot
# init 6

Nvidia-fabricmanager安装(可选)

  • nvidia-fabricmanager 是专门管理多张通过NVLink或NVSwitch互连的NVIDIA GPU的软件。
  • 如果只有单卡安装,服务启动会报错提示(本例中可以禁止启动)

查看当前驱动版本

  1. Driver Version: 595.71.05,支持最高CUDA Version 13.2
bash 复制代码
root@xunku:~# nvidia-smi 
Fri Jul  3 09:42:36 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.71.05              Driver Version: 595.71.05      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:3B:00.0 Off |                  N/A |
| 30%   31C    P8             22W /  350W |   20878MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           35736      C   VLLM::EngineCore                      20868MiB |
+-----------------------------------------------------------------------------------------+

下载对应版本fabricmanager

Nvidia官方下载地址

1.下载和驱动版本一样的fabricmanager软件。这里是595.71.05

  • nvidia-fabricmanager_*.deb:这是主软件包,运行服务所必需。

  • nvidia-fabricmanager-dev_*.deb:这是开发包(头文件等),仅在你需要编译基于该组件的软件时才需要。请忽略它们。

bash 复制代码
# 示例
export DRIVER_VERSION=595.71.05
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2204/x86_64/nvidia-fabricmanager-$(echo $DRIVER_VERSION | awk -F '.' '{print $1}')_${DRIVER_VERSION}-1_amd64.deb
bash 复制代码
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2204/x86_64/nvidia-fabricmanager_595.71.05-1ubuntu1_amd64.deb

安装fabricmanager

  1. 安装 fabricmanager
bash 复制代码
# 示例
dpkg -i nvidia-fabricmanager-$(echo $DRIVER_VERSION | awk -F '.' '{print $1}')_${DRIVER_VERSION}-1_amd64.deb
bash 复制代码
dpkg -i nvidia-fabricmanager_595.71.05-1ubuntu1_amd64.deb
  1. 查看是否正常运行
bash 复制代码
systemctl status nvidia-fabricmanager
bash 复制代码
(base) root@Ubuntu22:~# systemctl status nvidia-fabricmanager
● nvidia-fabricmanager.service - NVIDIA fabric manager service
     Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2025-09-16 18:06:03 CST; 9 months 15 days ago
   Main PID: 3290 (nv-fabricmanage)
      Tasks: 19 (limit: 629145)
     Memory: 21.2M
        CPU: 3h 55min 53.453s
     CGroup: /system.slice/nvidia-fabricmanager.service
             └─3290 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg

Notice: journal has been rotated since unit was started, output may be incomplete.
  1. 检查已安装的Fabric Manager版本
bash 复制代码
dpkg -l | grep nvidia-fabricmanager
bash 复制代码
(base) root@Ubuntu22:~# dpkg -l | grep nvidia-fabricmanager
ii  nvidia-fabricmanager-595               595.71.05-1ubuntu0.22.04.1             amd64        Fabric Manager for NVSwitch based systems.
  1. 禁止nvidia-fabricmanager自动升级
bash 复制代码
sudo apt-mark hold nvidia-fabricmanager-595
bash 复制代码
nvidia-fabricmanager-595 set on hold.
  1. 查看已禁用版本,有输出则为已禁用
bash 复制代码
sudo apt-mark showhold
bash 复制代码
nvidia-fabricmanager-595

安装CUDA Toolkit

CUDA Toolkit 说明

  1. 这是由NVIDIA提供的、用于开发和运行GPU加速应用程序的完整软件平台。它包含了编译器、数学库、调试工具等。具体如下:
  • CUDA 驱动(nvidia-driver):已经安装了 nvidia-driver-595-open。
  • CUDA 运行时(CUDA Runtime)及开发工具(nvcc编译器、cuBLAS等库):这是CUDA Toolkit软件包的主体。
  • 如果需要:编译或运行任何直接调用GPU的C++/Python程序(例如,从源码编译PyTorch/TensorFlow,运行CUDA C++项目),那么必须安装CUDA Toolkit。

CUDA Toolkit 安装

  1. 查看当前安装版本(如果安装过)
bash 复制代码
nvcc -V
bash 复制代码
(base) root@ubuntu:/public/software# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jan_15_19:20:09_PST_2025
Cuda compilation tools, release 12.8, V12.8.61
Build cuda_12.8.r12.8/compiler.35404655_0
  1. 删除已安装的CUDA Toolkit 包
bash 复制代码
apt remove --purge cuda-toolkit-*
  1. 安装指定版本的CUDA Toolkit 包

选择:Linux -> x86_64 ->Ubuntu -> 22.04 -> deb (network)

严格按照网页上给出的命令行指令执行即可。网络安装方式会自动配置源,并确保安装与系统驱动兼容的版本。

bash 复制代码
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
bash 复制代码
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
  1. 查看可安装的版本
bash 复制代码
apt-cache search cuda-toolkit

nvidia-cuda-toolkit - NVIDIA CUDA development toolkit
nvidia-cuda-toolkit-doc - NVIDIA CUDA and OpenCL documentation
nvidia-cuda-toolkit-gcc - NVIDIA CUDA development toolkit (GCC compatibility)
......
cuda-toolkit-12-9 - CUDA Toolkit 12.9 meta-package
cuda-toolkit-12-9-config-common - Common config package for CUDA Toolkit 12.9.
cuda-toolkit-13-0 - CUDA Toolkit 13.0 meta-package
cuda-toolkit-13-0-config-common - Common config package for CUDA Toolkit 13.0.
cuda-toolkit-13-config-common - Common config package for CUDA Toolkit 13.
cuda-toolkit-13-1 - CUDA Toolkit 13.1 meta-package
cuda-toolkit-13-1-config-common - Common config package for CUDA Toolkit 13.1.
cuda-toolkit-13-2 - CUDA Toolkit 13.2 meta-package
cuda-toolkit-13-2-config-common - Common config package for CUDA Toolkit 13.2.
cuda-toolkit-13 - CUDA Toolkit 13 meta-package
cuda-toolkit - CUDA Toolkit meta-package
cuda-toolkit-13-3 - CUDA Toolkit 13.3 meta-package
cuda-toolkit-13-3-config-common - Common config package for CUDA Toolkit 13.3.
bash 复制代码
apt list | grep -E "cuda-toolkit-[0-9]{2}-[0-9]{1,2}"

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
......
cuda-toolkit-12-8/unknown 12.8.2-1 amd64
cuda-toolkit-12-9-config-common/unknown 12.9.79-1 all
cuda-toolkit-12-9/unknown 12.9.2-1 amd64
cuda-toolkit-13-0-config-common/unknown 13.0.96-1 all
cuda-toolkit-13-0/unknown 13.0.3-1 amd64
cuda-toolkit-13-1-config-common/unknown 13.1.80-1 all
cuda-toolkit-13-1/unknown 13.1.2-1 amd64
cuda-toolkit-13-2-config-common/unknown 13.2.75-1 all
cuda-toolkit-13-2/unknown 13.2.1-1 amd64
cuda-toolkit-13-3-config-common/unknown,now 13.3.29-1 all [installed,auto-removable]
cuda-toolkit-13-3/unknown 13.3.1-1 amd64
  1. 安装指定版本
  • ** 注意,不是安装最新版本就好 **
  • ** 安装vLLM 默认预编译版本 **(当前是13.0,后续会随vLLM发展而变化)
  • nvidia-smi 命令右上角显示驱动支持最高的CUDA版本(当前是13。2,后续会随发展而变化)
bash 复制代码
apt install cuda-toolkit-13-0 cuda-toolkit-13-0-config-common
  1. 添加环境变量
bash 复制代码
vim /etc/profile
bash 复制代码
export PATH="/usr/local/cuda-13.0/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-13.0/lib64:$LD_LIBRARY_PATH" 
bash 复制代码
source /etc/profile
  1. 查看结果
bash 复制代码
nvcc -V
bash 复制代码
(vllm-env) root@xunku:/public/vLLM# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:58:59_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0
  1. 禁止自动升级
bash 复制代码
sudo apt-mark hold cuda-toolkit-13-0 cuda-toolkit-13-0-config-common

Docker GPU 说明

Docker GPU 支持 (NVIDIA Container Toolkit),是一个让Docker容器能够访问和使用宿主机(Host)NVIDIA GPU的工具集。它实质上是创建了一个兼容层,将宿主机的GPU驱动映射到容器内部。具体如下:

  • 主要是 nvidia-container-toolkit 这个包,它会修改Docker的配置。
  • 如果需要:在Docker容器内运行任何需要GPU的镜像(例如,运行 docker run --gpus all nvidia/cuda:12.1.1-base-ubuntu24.04 或官方的PyTorch/TensorFlow Docker镜像),那么必须安装此工具包。

Docker GPU 安装

  1. 安装工具
bash 复制代码
sudo apt-get update && sudo apt-get install -y --no-install-recommends curl gnupg2
  1. 配置仓库
bash 复制代码
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
  1. 更新仓库
bash 复制代码
sudo apt-get update
  1. 安装工具包
bash 复制代码
export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.18.1-1
  sudo apt-get install -y \
      nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}
  1. 配置容器
bash 复制代码
sudo nvidia-ctk runtime configure --runtime=docker
bash 复制代码
INFO[0000] Loading config from /etc/docker/daemon.json  
INFO[0000] Wrote updated config to /etc/docker/daemon.json 
INFO[0000] It is recommended that docker daemon be restarted. 
  1. 重启容器
bash 复制代码
sudo systemctl restart docker
  1. 运行测试容器
  • 找了半天,终于找到了一个可以下载的镜像
bash 复制代码
docker pull nvidia/cuda:13.0.1-runtime-ubuntu22.04
13.0.1-runtime-ubuntu22.04: Pulling from nvidia/cuda
60d98d907669: Pulling fs layer 
......
f1e29f967bcf: Pull complete 
48feaf8fd5bd: Pull complete 
8006ce821e80: Pull complete 
Digest: sha256:e4511e846c49e5495ef3d80c82b8f5dd597c6ef5c7f355601ead776ae3e96c67
Status: Downloaded newer image for nvidia/cuda:13.0.1-runtime-ubuntu22.04
docker.io/nvidia/cuda:13.0.1-runtime-ubuntu22.04
bash 复制代码
docker run --rm --gpus all nvidia/cuda:13.0.1-runtime-ubuntu22.04 nvidia-smi
bash 复制代码
==========
== CUDA ==
==========

CUDA Version 13.0.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Tue Jan 20 13:37:44 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01              Driver Version: 590.48.01      CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:03:00.0 Off |                  N/A |
| 30%   26C    P8             10W /  350W |       4MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

下载模型

  1. 魔塔社区
  2. 模型详细页

安装modelscope

  1. 使用系统自带Python 3.10.12
bash 复制代码
root@xunku:/public# which python3
/usr/bin/python3
bash 复制代码
root@xunku:/public# python3 --version
Python 3.10.12
bash 复制代码
apt install -y python3-pip
bash 复制代码
root@xunku:/public# pip3 --version
pip 22.0.2 from /usr/lib/python3/dist-packages/pip (python 3.10)
  1. 安装 modelscope
bash 复制代码
pip3 install modelscope -i https://mirrors.aliyun.com/pypi/simple/
bash 复制代码
root@xunku:/public/model# modelscope --version

 _   .-')                _ .-') _     ('-.             .-')                              _ (`-.    ('-.
( '.( OO )_             ( (  OO) )  _(  OO)           ( OO ).                           ( (OO  ) _(  OO)
 ,--.   ,--.).-'),-----. \     .'_ (,------.,--.     (_)---\_)   .-----.  .-'),-----.  _.`     \(,------.
 |   `.'   |( OO'  .-.  ',`'--..._) |  .---'|  |.-') /    _ |   '  .--./ ( OO'  .-.  '(__...--'' |  .---'
 |         |/   |  | |  ||  |  \  ' |  |    |  | OO )\  :` `.   |  |('-. /   |  | |  | |  /  | | |  |
 |  |'.'|  |\_) |  |\|  ||  |   ' |(|  '--. |  |`-' | '..`''.) /_) |OO  )\_) |  |\|  | |  |_.' |(|  '--.
 |  |   |  |  \ |  | |  ||  |   / : |  .--'(|  '---.'.-._)   \ ||  |`-'|   \ |  | |  | |  .___.' |  .--'
 |  |   |  |   `'  '-'  '|  '--'  / |  `---.|      | \       /(_'  '--'\    `'  '-'  ' |  |      |  `---.
 `--'   `--'     `-----' `-------'  `------'`------'  `-----'    `-----'      `-----'  `--'      `------'

modelscope-hub 0.1.5

下载模型

bash 复制代码
mkdir -p /public/model && cd /public/model

modelscope download --model Qwen/Qwen3-14B-AWQ --local_dir ./models/Qwen3-14B-AWQ
modelscope download --model Qwen/Qwen3-8B-AWQ --local_dir ./models/Qwen3-8B-AWQ
modelscope download --model Qwen/Qwen3-4B --local_dir ./models/Qwen3-4B
modelscope download --model Qwen/Qwen3.5-4B --local_dir ./models/Qwen3.5-4B
bash 复制代码
root@xunku:/public/model# du -sh ./models/*
9.4G    ./models/Qwen3-14B-AWQ
7.6G    ./models/Qwen3-4B
5.7G    ./models/Qwen3-8B-AWQ
8.8G    ./models/Qwen3.5-4B

安装Python虚拟环境

安装pyenv

  1. 准备依赖包
bash 复制代码
sudo apt update
sudo apt install -y make build-essential libssl-dev zlib1g-dev \
libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm \
libncurses5-dev libncursesw5-dev xz-utils tk-dev libffi-dev liblzma-dev
  1. 安装 pyenv
bash 复制代码
curl https://pyenv.run | bash
  1. 将以下内容添加到 ~/.bashrc
bash 复制代码
echo 'export PATH="$HOME/.pyenv/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(pyenv init --path)"' >> ~/.bashrc
echo 'eval "$(pyenv virtualenv-init -)"' >> ~/.bashrc
source ~/.bashrc

安装 Python 3.12.2(vLLM 兼容的稳定版本)

  1. 安装
bash 复制代码
pyenv install 3.12.2
bash 复制代码
Downloading Python-3.12.2.tar.xz...
-> https://www.python.org/ftp/python/3.12.2/Python-3.12.2.tar.xz
Installing Python-3.12.2...
Installed Python-3.12.2 to /root/.pyenv/versions/3.12.2
  1. 验证
bash 复制代码
pyenv versions  # 应显示 * system 和 3.12.2
bash 复制代码
* system (set by /root/.pyenv/version)
  3.12.2

创建vLLM独立虚拟环境

创建项目虚拟环境

bash 复制代码
# 创建项目目录
mkdir -p /public/vLLM && cd /public/vLLM

# 创建专属虚拟环境(名称可自定义)
pyenv virtualenv 3.12.2 vllm-env

# 激活环境(仅当前目录生效,避免污染全局,注意当前目录是/public/vLLM)
pyenv local vllm-env

# 验证 Python 版本
python --version  # 必须输出 Python 3.12.2

进入项目目录,环境自动激活

项目目录在 /public/vLLM,pyenv 保持默认,退出当前终端或soure ~/.bashrc,以后每次进入项目目录,虚拟环境自动激活。

bash 复制代码
cd /public/VLLM
python --version  # 自动显示 Python 3.12.2
bash 复制代码
# 进入项目目录
root@xunku:/public# cd /public/vLLM/
# 验证python版本
(vllm-env) root@xunku:/public/vLLM# python --version
Python 3.12.2
# 验证pip版本
(vllm-env) root@xunku:/public/vLLM# pip --version
pip 24.0 from /root/.pyenv/versions/3.12.2/envs/vllm-env/lib/python3.12/site-packages/pip (python 3.12)

安装vLLM

确认 CUDA 环境变量正确

bash 复制代码
(vllm-env) root@xunku:/public/VLLM# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:58:59_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0
bash 复制代码
(vllm-env) root@xunku:/public/VLLM# echo $LD_LIBRARY_PATH | grep cuda-13.0
/usr/local/cuda-13.0/lib64:

验证 CUDA 13.0 可用性(≤13.2)

bash 复制代码
(vllm-env) root@xunku:/public/VLLM# nvidia-smi
Thu Jul  2 16:54:08 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.71.05              Driver Version: 595.71.05      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:3B:00.0 Off |                  N/A |
| 30%   26C    P8             11W /  350W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

清除之前的错误安装包(可选)

  1. 清除之前的错误安装包(如果存在)
bash 复制代码
(vllm-env) root@xunku:/public/VLLM# pip uninstall -y torch torchvision torchaudio vllm
  1. 清理缓存
bash 复制代码
pip cache purge

安装 vLLM(自动匹配 CUDA 13.0)

  • 注意,只需要安装vLLM,它会自己安装依赖包(PyTorch)。不要去手动安装PyTorch
  • vLLM-安装文档
bash 复制代码
pip install vllm --no-cache-dir
# 若遇网络问题,可指定国内镜像
#pip install vllm -i https://pypi.tuna.tsinghua.edu.cn/simple --no-cache-dir
bash 复制代码
......

Successfully installed aiohappyeyeballs-2.7.1 aiohttp-3.14.1 aiosignal-1.4.0 annotated-doc-0.0.4 annotated-types-0.7.0 anthropic-0.115.1 anyio-4.14.1 apache-tvm-ffi-0.1.9 astor-0.8.1 attrs-26.1.0 blake3-1.0.9 cachetools-7.1.4 cbor2-6.1.2 certifi-2026.6.17 cffi-2.0.0 charset_normalizer-3.4.7 click-8.4.2 cloudpickle-3.1.2 compressed-tensors-0.17.0 cryptography-49.0.0 cuda-bindings-13.3.1 cuda-core-1.0.1 cuda-pathfinder-1.5.6 cuda-python-13.3.1 cuda-tile-1.3.0 cuda-toolkit-13.0.2 depyf-0.20.0 detect-installer-0.1.0 dill-0.4.1 diskcache-5.6.3 distro-1.9.0 dnspython-2.8.0 docstring-parser-0.18.0 einops-0.8.2 email-validator-2.3.0 fastapi-0.136.3 fastapi-cli-0.0.27 fastapi-cloud-cli-0.22.1 fastar-0.11.0 fastsafetensors-0.3.2 flashinfer-cubin-0.6.12 flashinfer-python-0.6.12 frozenlist-1.8.0 googleapis-common-protos-1.75.0 grpcio-1.81.1 h11-0.16.0 hf-xet-1.5.1 httpcore-1.0.9 httptools-0.8.0 httpx-0.28.1 httpx-sse-0.4.3 huggingface-hub-1.21.0 humming-kernels-0.1.6 idna-3.18 ijson-3.5.0 interegular-0.3.3 jiter-0.16.0 jmespath-1.1.0 jsonschema-4.26.0 jsonschema-specifications-2025.9.1 lark-1.2.2 llguidance-1.7.6 llvmlite-0.47.0 lm-format-enforcer-0.11.3 loguru-0.7.3 markdown-it-py-4.2.0 mcp-1.28.1 mdurl-0.1.2 mistral_common-1.11.5 ml-dtypes-0.5.4 model-hosting-container-standards-0.1.16 msgspec-0.21.1 multidict-6.7.1 ninja-1.13.0 numba-0.65.0 numpy-2.3.5 nvidia-cublas-13.1.0.3 nvidia-cuda-cccl-13.3.3.4.1 nvidia-cuda-crt-13.3.73 nvidia-cuda-cupti-13.0.85 nvidia-cuda-nvcc-13.2.78 nvidia-cuda-nvrtc-13.0.88 nvidia-cuda-runtime-13.0.96 nvidia-cuda-tileiras-13.2.78 nvidia-cudnn-cu13-9.19.0.56 nvidia-cudnn-frontend-1.25.0 nvidia-cufft-12.0.0.61 nvidia-cufile-1.15.1.6 nvidia-curand-10.4.0.35 nvidia-cusolver-12.0.4.66 nvidia-cusparse-12.6.3.3 nvidia-cusparselt-cu13-0.8.0 nvidia-cutlass-dsl-4.5.2 nvidia-cutlass-dsl-libs-base-4.5.2 nvidia-cutlass-dsl-libs-cu13-4.5.2 nvidia-ml-py-13.610.43 nvidia-nccl-cu13-2.28.9 nvidia-nvjitlink-13.0.88 nvidia-nvshmem-cu13-3.4.5 nvidia-nvtx-13.0.85 nvidia-nvvm-13.2.78 openai-2.44.0 openai-harmony-0.0.8 opencv-python-headless-5.0.0.93 opentelemetry-api-1.43.0 opentelemetry-exporter-otlp-1.43.0 opentelemetry-exporter-otlp-proto-common-1.43.0 opentelemetry-exporter-otlp-proto-grpc-1.43.0 opentelemetry-exporter-otlp-proto-http-1.43.0 opentelemetry-proto-1.43.0 opentelemetry-sdk-1.43.0 opentelemetry-semantic-conventions-0.64b0 opentelemetry-semantic-conventions-ai-0.5.1 outlines_core-0.2.14 packaging-26.2 partial-json-parser-0.2.1.1.post7 prometheus-fastapi-instrumentator-8.0.2 prometheus_client-0.25.0 propcache-0.5.2 protobuf-7.35.1 psutil-7.2.2 py-cpuinfo-9.0.0 pybase64-1.4.3 pycountry-26.2.16 pycparser-3.0 pydantic-2.13.4 pydantic-core-2.46.4 pydantic-extra-types-2.11.1 pydantic-settings-2.14.2 pyelftools-0.33 pygments-2.20.0 pyjwt-2.13.0 python-dotenv-1.2.2 python-json-logger-4.1.0 python-multipart-0.0.32 pyyaml-6.0.3 pyzmq-27.1.0 quack-kernels-0.5.0 referencing-0.37.0 regex-2026.6.28 requests-2.34.2 rich-15.0.0 rich-toolkit-0.20.1 rignore-0.7.6 rpds-py-2026.6.3 safetensors-0.8.0 sentencepiece-0.2.1 sentry-sdk-2.64.0 setproctitle-1.3.7 setuptools-80.10.2 shellingham-1.5.4 six-1.17.0 sniffio-1.3.1 sse-starlette-3.4.5 starlette-1.3.1 supervisor-4.3.0 sympy-1.14.0 tabulate-0.10.0 tiktoken-0.13.0 tilelang-0.1.9 tokenizers-0.22.2 tokenspeed-mla-0.1.2 tokenspeed-triton-3.7.10.post20260531 torch-2.11.0 torch-c-dlpack-ext-0.1.5 torchaudio-2.11.0 torchvision-0.26.0 tqdm-4.68.3 transformers-5.12.1 triton-3.6.0 typer-0.25.1 typing-inspection-0.4.2 urllib3-2.7.0 uvicorn-0.49.0 uvloop-0.22.1 vllm-0.24.0 watchfiles-1.2.0 websockets-16.0 xgrammar-0.2.3 yarl-1.24.2 z3-solver-4.15.4.0

[notice] A new release of pip is available: 24.0 -> 26.1.2
[notice] To update, run: python -m pip install --upgrade pip

验证安装

  1. 验证PyTorch
bash 复制代码
(vllm-env) root@xunku:/public/vLLM# python -c "import torch; assert torch.cuda.is_available(); print(f'PyTorch CUDA {torch.version.cuda} OK')"
PyTorch CUDA 13.0 OK
(vllm-env) root@xunku:/public/vLLM# 
bash 复制代码
(vllm-env) root@xunku:/public/vLLM# python -c "import torch; \
  print(f'实际 CUDA 编译版本: {torch._C._cuda_getCompiledVersion()}'); \
  print(f'版本号标识: {torch.version.cuda}'); \
  print(f'驱动兼容性: {torch.cuda.is_available()}')"
实际 CUDA 编译版本: 13000
版本号标识: 13.0
驱动兼容性: True
  1. 验证vLLM
bash 复制代码
(vllm-env) root@xunku:/public/vLLM# python -c "import torch, vllm; print(f'vLLM {vllm.__version__} + CUDA {torch.version.cuda} OK')"
vLLM 0.24.0 + CUDA 13.0 OK
  1. 验证GPU可用性
  • True 代表可用
bash 复制代码
(vllm-env) root@xunku:/public/vLLM# python -c "import torch; print(torch.cuda.is_available())"
True

启动项目

  1. 命令解析
bash 复制代码
# vllm启动模型
vllm serve /public/model/models/Qwen3-14B-AWQ \
  # 监听端口
  --port 8000 \
  # 监听IP
  --host 0.0.0.0 \
  # 明确指定量化
  --quantization awq \
  # 单批次中可同时处理的最大请求数量,默认值1024,可观察KV Cache 利用率进行调整
  --max-num-seqs 64 \
  # 单个处理步骤(Iteration)中,所有请求的Token总数上限,默认值2048
  --max-num-batched-tokens 8192 \
  # PagedAttention机制中,每个KV Cache块能存储的Token数量,默认值16
  --block-size 32 \
  # 将长提示词(Prefill)分成小块处理,并与解码(Decode)请求混合在同一批次中,高版本默认开启
  --enable-chunked-prefill \
  # 显存利用率设为90%(24GB卡安全阈值),默认值0.9
  --gpu-memory-utilization 0.9 \
  # Qwen3 量化原生支持40K上下文。如果需要加大(比如hermes要求64K),请使用YaRN 技术,默认值4096
  --max-model-len 32768 \
  # 启用Qwen3思考模式解析
  --reasoning-parser qwen3 \
  # 工具调用解析器qwen3_xml 
  --tool-call-parser qwen3_xml \  
  # 启用自动工具选择
  --enable-auto-tool-choice \
  # 外部调用模型名,与 Hermes 配置严格一致的模型名
  --served-model-name qwen3-14b-awq \
  # 启用前缀缓存(例如系统提示词)
  --enable-prefix-caching \
  # 启用api-key,可选
  #--api-key sk-env-test-20260703 \
  # 模型含自定义代码需开启
  --trust-remote-code
  # 单卡不能设置 --tensor-parallel-size
  # 多卡按数量设置,例如8卡 --tensor-parallel-size 8
  1. 启动脚本
bash 复制代码
cd /public/vLLM/ &&  vim restart.sh 
bash 复制代码
#!/bin/bash

# ===== 功能说明 =====
# 1. 检查并终止已有 vLLM 相关进程
# 2. 后台启动服务 + 日志轮转
# 3. 严格校验命令参数(移除非法注释)
# 4. 启动状态实时反馈

LOG_DIR="/public/vLLM/logs"
PID_FILE="/public/vLLM/vllm.pid"
mkdir -p "$LOG_DIR"

# ===== 步骤1:终止已有进程 =====
echo "[$(date +'%Y-%m-%d %H:%M:%S')] 检查运行中的 vLLM 进程..."
EXISTING_PIDS=$(ps aux | grep -E 'vllm|EngineCore|APIServer' | grep -v 'grep' | awk '{print $2}')

if [ -n "$EXISTING_PIDS" ]; then
    echo "检测到以下需终止的进程: $EXISTING_PIDS"
    echo "$EXISTING_PIDS" | xargs kill -9 2>/dev/null
    sleep 2  # 确保进程完全退出
    # 二次检查
    REMAINING=$(ps aux | grep -E 'vllm|EngineCore|APIServer' | grep -v 'grep' | wc -l)
    if [ "$REMAINING" -gt 0 ]; then
        echo "[$(date +'%Y-%m-%d %H:%M:%S')] 错误: 无法完全终止旧进程,请手动检查!" >&2
        exit 1
    fi
    echo "[$(date +'%Y-%m-%d %H:%M:%S')] 旧进程已成功终止"
else
    echo "[$(date +'%Y-%m-%d %H:%M:%S')] 无冲突进程,直接启动服务"
fi

# ===== 步骤2:启动新服务 =====
nohup vllm serve /public/model/models/Qwen3-14B-AWQ \
  --port 8000 \
  --host 0.0.0.0 \
  --quantization awq \
  --max-num-seqs 64 \
  --max-num-batched-tokens 8192 \
  --block-size 32 \
  --enable-chunked-prefill \
  --gpu-memory-utilization 0.9 \
  --max-model-len 32678 \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_xml \
  --enable-auto-tool-choice \
  --served-model-name qwen3-14b-awq \
  --trust-remote-code > "$LOG_DIR/vllm_$(date +'%Y%m%d_%H%M%S').log" 2>&1 &

NEW_PID=$!
echo $NEW_PID > "$PID_FILE"

# ===== 步骤3:验证启动状态 =====
echo "[$(date +'%Y-%m-%d %H:%M:%S')] 启动中... PID=$NEW_PID,等待服务就绪(最长300秒)"
START_TIME=$(date +%s)
while true; do
    if curl -s http://localhost:8000/health > /dev/null; then
        echo "[$(date +'%Y-%m-%d %H:%M:%S')] 服务启动成功!访问地址: http://ip:8000/v1/models"
        exit 0
    fi
    
    ELAPSED=$(( $(date +%s) - START_TIME ))
    if [ $ELAPSED -gt 300 ]; then
        echo "[$(date +'%Y-%m-%d %H:%M:%S')] 错误: 服务启动超时,请检查日志 $LOG_DIR" >&2
        kill -9 $NEW_PID 2>/dev/null
        exit 1
    fi
    sleep 2
done
  1. 启动
    注意,必须显进入项目虚拟环境目录
bash 复制代码
root@xunku:~# cd /public/vLLM/
(vllm-env) root@xunku:/public/vLLM# ls
logs  restart.sh  start.sh.orig  vllm.pid
(vllm-env) root@xunku:/public/vLLM# ./restart.sh 

验证测试

  1. 获取模型名称
  • "id":"qwen3-14b-awq"
bash 复制代码
(vllm-env) root@xunku:/public/vLLM# curl http://localhost:8000/v1/models

{"object":"list","data":[{"id":"qwen3-14b-awq","object":"model","created":1783058807,"owned_by":"vllm","root":"/public/model/models/Qwen3-14B-AWQ","parent":null,"max_model_len":32678 ,"permission":[{"id":"modelperm-ab87467a8a23f2c5","object":"model_permission","created":1783058807,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}
  1. 测试调用
bash 复制代码
(vllm-env) root@xunku:/public/vLLM# curl http://localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "qwen3-14b-awq",
    "messages": [{"role": "user", "content": "你好"}]
  }'
  
{"id":"chatcmpl-bcef4b2c5c3669ba","object":"chat.completion","created":1783001363,"model":"qwen3-14b-awq","choices":[{"index":0,"message":{"role":"assistant","content":"\n\n你好呀!😊 有什么我可以帮你的吗?","refusal":null,"annotations":null,"audio":null,"function_call":null,"reasoning":"\n好的,用户打招呼说"你好",我需要友好回应。首先,确认用户的需求,可能只是普通问候,也可能有后续问题。保持开放态度,用中文回复,简洁自然。可以加上表情符号增加亲切感,但不过度。确保没有格式错误,直接回应即可。\n"},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null,"routed_experts":null}],"service_tier":null,"system_fingerprint":"vllm-0.24.0-2702c72b","usage":{"prompt_tokens":9,"total_tokens":86,"completion_tokens":77,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"prompt_text":null,"kv_transfer_params":null}
  1. Postman调用