环境
系统:CentOS-7
CPU : E5-2680V4 14核28线程
内存:DDR4 2133 32G * 2
显卡:Tesla V100-32G【PG503】 (水冷)
驱动: 535
CUDA: 12.2
环境
[第七十六篇-V100+llama-cpp-python+Qwen3-30B+GGUF-CSDN博客](https://blog.csdn.net/hai4321/article/details/157739271)
安装依赖
bash
pip install sentencepiece -i https://mirrors.cloud.tencent.com/pypi/simple
pip install uvicorn -i https://mirrors.cloud.tencent.com/pypi/simple
pip install starlette -i https://mirrors.cloud.tencent.com/pypi/simple
pip installfastapi -i https://mirrors.cloud.tencent.com/pypi/simple
pip install fastapi -i https://mirrors.cloud.tencent.com/pypi/simple
pip install sse_starlette -i https://mirrors.cloud.tencent.com/pypi/simple
pip install starlette_context -i https://mirrors.cloud.tencent.com/pypi/simple
pip install pydantic_settings -i https://mirrors.cloud.tencent.com/pypi/simple
如有需要再自己安装
代码
bash
#!/usr/bin/env python3
# server.py
from llama_cpp import Llama
from llama_cpp.server.app import create_app
from llama_cpp.server.settings import Settings
import uvicorn
MODEL_PATH = "/models/GGUF_LIST/Qwen3-30B-A3B-Thinking-2507-Q4_K_M.gguf"
settings = Settings(
model=MODEL_PATH,
n_ctx=32768,
n_gpu_layers=65, # V100 32GB
n_threads=28,
n_batch=512,
chat_format="qwen", # Qwen3 专用 chat template
host="0.0.0.0",
port=8000,
verbose=False,
)
app = create_app(settings)
if __name__ == "__main__":
uvicorn.run(app, host=settings.host, port=settings.port)
运行
bash
python server.py
访问
bash
# 1. 查看可用模型
curl http://localhost:8000/v1/models
# 2. 非流式对话
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-next-80b-instruct-Q4_K_M.gguf",
"messages": [
{"role": "system", "content": "你是一个乐于助人的AI助手"},
{"role": "user", "content": "1+1等于几?"}
],
"max_tokens": 100,
"temperature": 0.7
}' | jq .
# 3. 流式对话(推荐用于长文本)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-next-80b-instruct-Q4_K_M.gguf",
"messages": [{"role": "user", "content": "请写一首关于春天的诗"}],
"stream": true
}'