由于模型、权重文件已经下载好了,所有跳过这些步骤。open-webui也在昨天已经安装好,同样跳过。 无废话流程
硬件环境
租的AutoDL的GPU服务器做的测试
•软件环境 PyTorch 2.5.1、Python 3.12(ubuntu22.04)、Cuda 12.4
•硬件环境 ○GPU:RTX 4090(24GB) * 4(实际只使用一张GPU) ○CPU:64 vCPU Intel(R) Xeon(R) Gold 6430 ○内存:480G(至少需要382G) ○硬盘:1.8T(实际使用需要380G左右)
一、创建环境
创建虚拟环境
shell
conda create --prefix=/root/autodl-tmp/jacky/envs/deepseekr1-671b python==3.12.3
conda activate /root/autodl-tmp/jacky/envs/deepseekr1-671b
安装 PyTorch、packaging、ninja
shell
pip install torch packaging ninja cpufeature numpy
安装flash-attn
shell
pip install flash-attn
安装libstdcxx-ng
shell
conda install -c conda-forge libstdcxx-ng
二、编译安装ktransformers
shell
cd /root/autodl-tmp/jacky
cp -r ktransformers ktransformers-new
cd ktransformers-new
export TORCH_CUDA_ARCH_LIST="8.9"
pip install -r requirements-local_chat.txt
pip install setuptools wheel packaging
修改./install.sh,加入:
export MAX_JOBS=64 export CMAKE_BUILD_PARALLEL_LEVEL=64
shell
sh install.sh
三、运行
运行ktransformer
shell
export TORCH_CUDA_ARCH_LIST="8.9"
启动命令行聊天
shell
export TORCH_CUDA_ARCH_LIST="8.9"
python ./ktransformers/local_chat.py --model_path /root/autodl-tmp/DeepSeek-R1 --gguf_path /root/autodl-tmp/DeepSeek-R1-GGUF --cpu_infer 64 --max_new_tokens 1000 --force_think true | tee runlog1.log
启动本地聊天API端点
shell
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python3 ktransformers/server/main.py \
--gguf_path /root/autodl-tmp/DeepSeek-R1-GGUF/ \
--model_path /root/autodl-tmp/DeepSeek-R1 \
--model_name deepseek-r1-new \
--cpu_infer 64 \
--max_new_tokens 8192 \
--cache_lens 32768 \
--total_context 32768 \
--cache_q4 true \
--temperature 0.6 \
--top_p 0.95 \
--optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml \
--force_think \
--use_cuda_graph \
--host 127.0.0.1 \
--port 12345
运行open-webui
shell
cd /root/autodl-tmp/jacky/open-webui
sh start.sh
建立 ssh转发
等服务器上webui和api端点都起来后,在本地PC上,建一个ssh转发规则
shell
ssh -CNg -L 3000:127.0.0.1:3000 [email protected] -p 22305
打开浏览器进行测试
四、参数调整
将cpu_info降低,观察tps变化
直接上结论,数据看后面:
-
测试之前的猜想:理论上讲cpu_infer越小就相不于把更多的事情放到gpu去做,在GPU够的情况下,应该是会有改善的(前面测试下来GPU使用率才50%不到)。
-
测试并在ktransformers文档中确认后,被打脸了。这个cpu_infer通常是大一点(比实际的CPU核数小)会比较好。其工作原理并不是cpu_infer越小就会放越多的任务到gpu,什么任务在什么地方跑是由yaml配置文件里指定好的,跟cpu_infer的值无关。
cpu_info = 64
- 命令
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python3 ktransformers/server/main.py --gguf_path /root/autodl-tmp/DeepSeek-R1-GGUF/ --model_path /root/autodl-tmp/DeepSeek-R1 --model_name deepseek-r1-new --cpu_infer 64 --max_new_tokens 8192 --cache_lens 32768 --total_context 32768 --cache_q4 true --temperature 0.6 --top_p 0.95 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml --force_think --use_cuda_graph --host 127.0.0.1 --port 12344
- 输出
Performance(T/s): prefill 47.41532433167817, decode 5.593497795879592. Time(s): tokenize 0.08410024642944336, prefill 21.40658140182495, decode 84.56247186660767 Performance(T/s): prefill 44.39721927042498, decode 5.727537880501856. Time(s): tokenize 0.015021562576293945, prefill 15.856848955154419, decode 290.0024468898773
cpu_infer = 32
- 命令
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python3 ktransformers/server/main.py --gguf_path /root/autodl-tmp/DeepSeek-R1-GGUF/ --model_path /root/autodl-tmp/DeepSeek-R1 --model_name deepseek-r1-new --cpu_infer 32 --max_new_tokens 8192 --cache_lens 32768 --total_context 32768 --cache_q4 true --temperature 0.6 --top_p 0.95 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml --force_think --use_cuda_graph --host 127.0.0.1 --port 12344
- 输出
Performance(T/s): prefill 53.283252444225695, decode 7.717662862281071. Time(s): tokenize 0.07866573333740234, prefill 34.41982078552246, decode 69.19193148612976 Performance(T/s): prefill 46.742691185571395, decode 7.326065169900766. Time(s): tokenize 0.02002429962158203, prefill 39.407230377197266, decode 285.0097496509552
cpu_infer = 16
- 命令
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python3 ktransformers/server/main.py --gguf_path /root/autodl-tmp/DeepSeek-R1-GGUF/ --model_path /root/autodl-tmp/DeepSeek-R1 --model_name deepseek-r1-new --cpu_infer 16 --max_new_tokens 8192 --cache_lens 32768 --total_context 32768 --cache_q4 true --temperature 0.6 --top_p 0.95 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml --force_think --use_cuda_graph --host 127.0.0.1 --port 12344
- 输出
启用多GPU
kvcache-ai.github.io/ktransforme... 不要用ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml,改用DeepSeek-V3-Chat-multi-gpu-4.yaml,或者自定义。
放大context上下文
命令行模式:local_chat.py
You can increase the context window size by setting --max_new_tokens to a larger value.
服务模式:server
Increase the `--cache_lens' to a larger value.
将更多权重移动到GPU
Refer to the ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-4.yaml
yaml
- match:
name: "^model\\.layers\\.([4-10])\\.mlp\\.experts$" # inject experts in layer 4~10 as marlin expert
replace:
class: ktransformers.operators.experts.KTransformersExperts
kwargs:
generate_device: "cuda:0" # run in cuda:0; marlin only support GPU
generate_op: "KExpertsMarlin" # use marlin expert
recursive: False
You can modify layer as you want, eg. name: "^model\.layers\.([4-10])\.mlp\.experts <math xmlns="http://www.w3.org/1998/Math/MathML"> " t o n a m e : " m o d e l . l a y e r s . ( [ 4 − 12 ] ) . m l p . e x p e r t s " to name: "^model\\.layers\\.([4-12])\\.mlp\\.experts </math>"toname:"model.layers.([4−12]).mlp.experts" to move more weights to the GPU.
Note: The first matched rule in yaml will be applied. For example, if you have two rules that match the same layer, only the first rule's replacement will be valid. Note:Currently, executing experts on the GPU will conflict with CUDA Graph. Without CUDA Graph, there will be a significant slowdown. Therefore, unless you have a substantial amount of VRAM (placing a single layer of experts for DeepSeek-V3/R1 on the GPU requires at least 5.6GB of VRAM), we do not recommend enabling this feature. We are actively working on optimization. Note KExpertsTorch is untested.
升级到ktransformers 0.3.0
ktransformers v0.2.2 没有开启amx, 开启了amx,应该可以增加40%性能,但是前面放出来的那个0.3 preview版本安装有问题,暂放弃。 github.com/kvcache-ai/...
五、问题
联网搜索功能异常
shell
File "/root/autodl-tmp/jacky/ds2/lib/python3.12/site-packages/open_webui/routers/retrieval.py", line 1400, in process_web_search
web_results = search_web(
└ <function search_web at 0x7f3336425bc0>
File "/root/autodl-tmp/jacky/ds2/lib/python3.12/site-packages/open_webui/routers/retrieval.py", line 1329, in search_web
return search_duckduckgo(
└ <function search_duckduckgo at 0x7f3337487060>
File "/root/autodl-tmp/jacky/ds2/lib/python3.12/site-packages/open_webui/retrieval/web/duckduckgo.py", line 27, in search_duckduckgo
ddgs_gen = ddgs.text(
│ └ <function DDGS.text at 0x7f3336be84a0>
└ <duckduckgo_search.duckduckgo_search.DDGS object at 0x7f32a0cf21e0>
File "/root/autodl-tmp/jacky/ds2/lib/python3.12/site-packages/duckduckgo_search/duckduckgo_search.py", line 252, in text
raise DuckDuckGoSearchException(err)
│ └ TimeoutException('https://lite.duckduckgo.com/lite/ RuntimeError: error sending request for url (https://lite.duckduckgo.com/...
└ <class 'duckduckgo_search.exceptions.DuckDuckGoSearchException'>
duckduckgo_search.exceptions.DuckDuckGoSearchException: https://lite.duckduckgo.com/lite/ RuntimeError: error sending request for url (https://lite.duckduckgo.com/lite/): operation timed out
Caused by:
operation timed out
六、初步结论
结合前辈们的结论,KTransformer项目部署硬件配置:
GPU单卡或者多卡对实际运行效率的影响
GPU对实际运行效率提升不大,单卡3090、单卡4090、或者是多卡GPU服务器都没有太大影响,只需要留足20G以上显存(最小可行性实验的话只需要14G显存)即可;
权重规则的运行效率tps的影响
若是多卡服务器,则可以进一步尝试手动编写模型权重卸载规则,使用更多的GPU进行推理,可以一定程度减少内存需求,但对于实际运行效率提升不大。最省钱的方案仍然是单卡GPU+大内存配置;
KTransformers版本对性能的影响
KTransformer目前有V0.2.0、V0.2.1、V0.2.2和V0.3.0,其中V0.3.0目前只有预览版,只支持二进制文件下载和安装,而V0.2.0和V0.2.1支持各类CPU。从V0.3.0开始,只支持AMX CPU,也就是最新几代的Intel CPU。 这几个版本实际部署流程和调用指令没有任何区别,若当前CPU支持AMX,则可以考虑使用V3.0进行实验,推理速度会提升40%左右。