参考:第七十九篇-E5-2680V4+V100-32G+llama-cpp编译运行+Qwen3-Next-80B-CSDN博客
版本
bash
./bin/llama-server --version
ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla PG503-216, compute capability 7.0, VMM: yes
version: 7937 (423bee4)
运行
bash
./bin/llama-server -m /models/GGUF_LIST/Qwen3-Coder-Next-UD-Q2_K_XL.gguf --host 0.0.0.0 --port 28000 --gpu-layers 999 --ctx-size 128000 --threads 28
资源消耗
GPU 32076MiB
bash
Tue Feb 10 22:52:37 2026
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla PG503-216 On | 00000000:04:00.0 Off | 0 |
| N/A 28C P0 137W / 250W | 32076MiB / 32768MiB | 96% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
CPU 1核在跑
内存 1G
速度 59.99 tokens/s