【Ktransformers+Deepseek R1】再来一遍，以验证安装流程

由于模型、权重文件已经下载好了，所有跳过这些步骤。open-webui也在昨天已经安装好，同样跳过。无废话流程

硬件环境

租的AutoDL的GPU服务器做的测试

•软件环境 PyTorch 2.5.1、Python 3.12(ubuntu22.04)、Cuda 12.4

•硬件环境￮GPU：RTX 4090(24GB) * 4（实际只使用一张GPU）￮CPU：64 vCPU Intel(R) Xeon(R) Gold 6430 ￮内存：480G（至少需要382G）￮硬盘：1.8T（实际使用需要380G左右）

一、创建环境

创建虚拟环境

shell 复制代码

conda create --prefix=/root/autodl-tmp/jacky/envs/deepseekr1-671b python==3.12.3
conda activate /root/autodl-tmp/jacky/envs/deepseekr1-671b

安装 PyTorch、packaging、ninja

shell 复制代码

pip install torch packaging ninja cpufeature numpy

安装flash-attn

shell 复制代码

pip install flash-attn

安装libstdcxx-ng

shell 复制代码

conda install -c conda-forge libstdcxx-ng

二、编译安装ktransformers

shell 复制代码

cd /root/autodl-tmp/jacky
cp -r ktransformers ktransformers-new
cd ktransformers-new

export TORCH_CUDA_ARCH_LIST="8.9"

pip install -r requirements-local_chat.txt
pip install setuptools wheel packaging

修改./install.sh，加入：

export MAX_JOBS=64 export CMAKE_BUILD_PARALLEL_LEVEL=64

shell 复制代码

sh install.sh

三、运行

运行ktransformer

shell 复制代码

export TORCH_CUDA_ARCH_LIST="8.9"

启动命令行聊天

shell 复制代码

export TORCH_CUDA_ARCH_LIST="8.9"
python ./ktransformers/local_chat.py --model_path /root/autodl-tmp/DeepSeek-R1 --gguf_path /root/autodl-tmp/DeepSeek-R1-GGUF --cpu_infer 64 --max_new_tokens 1000 --force_think true | tee runlog1.log

启动本地聊天API端点

shell 复制代码

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python3 ktransformers/server/main.py \
--gguf_path /root/autodl-tmp/DeepSeek-R1-GGUF/ \
--model_path /root/autodl-tmp/DeepSeek-R1 \
--model_name deepseek-r1-new \
--cpu_infer 64 \
--max_new_tokens 8192 \
--cache_lens 32768 \
--total_context 32768 \
--cache_q4 true \
--temperature 0.6 \
--top_p 0.95 \
--optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml \
--force_think \
--use_cuda_graph \
--host 127.0.0.1 \
--port 12345

运行open-webui

shell 复制代码

cd /root/autodl-tmp/jacky/open-webui
sh start.sh

建立 ssh转发

等服务器上webui和api端点都起来后，在本地PC上，建一个ssh转发规则

shell 复制代码

ssh -CNg -L 3000:127.0.0.1:3000 root@connect.nmb1.seetacloud.com -p 22305

打开浏览器进行测试

http://localhost:3000

四、参数调整

将cpu_info降低，观察tps变化

直接上结论，数据看后面：

测试之前的猜想：理论上讲cpu_infer越小就相不于把更多的事情放到gpu去做，在GPU够的情况下，应该是会有改善的（前面测试下来GPU使用率才50%不到）。
测试并在ktransformers文档中确认后，被打脸了。这个cpu_infer通常是大一点（比实际的CPU核数小）会比较好。其工作原理并不是cpu_infer越小就会放越多的任务到gpu，什么任务在什么地方跑是由yaml配置文件里指定好的，跟cpu_infer的值无关。

cpu_info = 64

命令

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python3 ktransformers/server/main.py --gguf_path /root/autodl-tmp/DeepSeek-R1-GGUF/ --model_path /root/autodl-tmp/DeepSeek-R1 --model_name deepseek-r1-new --cpu_infer 64 --max_new_tokens 8192 --cache_lens 32768 --total_context 32768 --cache_q4 true --temperature 0.6 --top_p 0.95 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml --force_think --use_cuda_graph --host 127.0.0.1 --port 12344

输出

Performance(T/s): prefill 47.41532433167817, decode 5.593497795879592. Time(s): tokenize 0.08410024642944336, prefill 21.40658140182495, decode 84.56247186660767 Performance(T/s): prefill 44.39721927042498, decode 5.727537880501856. Time(s): tokenize 0.015021562576293945, prefill 15.856848955154419, decode 290.0024468898773

cpu_infer = 32

命令

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python3 ktransformers/server/main.py --gguf_path /root/autodl-tmp/DeepSeek-R1-GGUF/ --model_path /root/autodl-tmp/DeepSeek-R1 --model_name deepseek-r1-new --cpu_infer 32 --max_new_tokens 8192 --cache_lens 32768 --total_context 32768 --cache_q4 true --temperature 0.6 --top_p 0.95 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml --force_think --use_cuda_graph --host 127.0.0.1 --port 12344

输出

Performance(T/s): prefill 53.283252444225695, decode 7.717662862281071. Time(s): tokenize 0.07866573333740234, prefill 34.41982078552246, decode 69.19193148612976 Performance(T/s): prefill 46.742691185571395, decode 7.326065169900766. Time(s): tokenize 0.02002429962158203, prefill 39.407230377197266, decode 285.0097496509552

cpu_infer = 16

命令

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python3 ktransformers/server/main.py --gguf_path /root/autodl-tmp/DeepSeek-R1-GGUF/ --model_path /root/autodl-tmp/DeepSeek-R1 --model_name deepseek-r1-new --cpu_infer 16 --max_new_tokens 8192 --cache_lens 32768 --total_context 32768 --cache_q4 true --temperature 0.6 --top_p 0.95 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml --force_think --use_cuda_graph --host 127.0.0.1 --port 12344

输出

启用多GPU

kvcache-ai.github.io/ktransforme... 不要用ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml，改用DeepSeek-V3-Chat-multi-gpu-4.yaml，或者自定义。

放大context上下文

命令行模式：local_chat.py

You can increase the context window size by setting --max_new_tokens to a larger value.

服务模式：server

Increase the `--cache_lens' to a larger value.

将更多权重移动到GPU

Refer to the ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-4.yaml

yaml 复制代码

- match:
   name: "^model\\.layers\\.([4-10])\\.mlp\\.experts$" # inject experts in layer 4~10 as marlin expert
 replace:
   class: ktransformers.operators.experts.KTransformersExperts  
   kwargs:
     generate_device: "cuda:0" # run in cuda:0; marlin only support GPU
     generate_op:  "KExpertsMarlin" # use marlin expert
 recursive: False

You can modify layer as you want, eg. name: "^model\.layers\.([4-10])\.mlp\.experts <math xmlns="http://www.w3.org/1998/Math/MathML"> " t o n a m e : " m o d e l . l a y e r s . ( [ 4 − 12 ] ) . m l p . e x p e r t s " to name: "^model\\.layers\\.([4-12])\\.mlp\\.experts </math>"toname:"model.layers.([4−12]).mlp.experts" to move more weights to the GPU.
Note: The first matched rule in yaml will be applied. For example, if you have two rules that match the same layer, only the first rule's replacement will be valid. Note：Currently, executing experts on the GPU will conflict with CUDA Graph. Without CUDA Graph, there will be a significant slowdown. Therefore, unless you have a substantial amount of VRAM (placing a single layer of experts for DeepSeek-V3/R1 on the GPU requires at least 5.6GB of VRAM), we do not recommend enabling this feature. We are actively working on optimization. Note KExpertsTorch is untested.

升级到ktransformers 0.3.0

ktransformers v0.2.2 没有开启amx，开启了amx，应该可以增加40%性能，但是前面放出来的那个0.3 preview版本安装有问题，暂放弃。 github.com/kvcache-ai/...

五、问题

联网搜索功能异常

shell 复制代码

  File "/root/autodl-tmp/jacky/ds2/lib/python3.12/site-packages/open_webui/routers/retrieval.py", line 1400, in process_web_search
    web_results = search_web(
                  └ <function search_web at 0x7f3336425bc0>
  File "/root/autodl-tmp/jacky/ds2/lib/python3.12/site-packages/open_webui/routers/retrieval.py", line 1329, in search_web
    return search_duckduckgo(
           └ <function search_duckduckgo at 0x7f3337487060>
  File "/root/autodl-tmp/jacky/ds2/lib/python3.12/site-packages/open_webui/retrieval/web/duckduckgo.py", line 27, in search_duckduckgo
    ddgs_gen = ddgs.text(
               │    └ <function DDGS.text at 0x7f3336be84a0>
               └ <duckduckgo_search.duckduckgo_search.DDGS object at 0x7f32a0cf21e0>
  File "/root/autodl-tmp/jacky/ds2/lib/python3.12/site-packages/duckduckgo_search/duckduckgo_search.py", line 252, in text
    raise DuckDuckGoSearchException(err)
          │                         └ TimeoutException('https://lite.duckduckgo.com/lite/ RuntimeError: error sending request for url (https://lite.duckduckgo.com/...
          └ <class 'duckduckgo_search.exceptions.DuckDuckGoSearchException'>

duckduckgo_search.exceptions.DuckDuckGoSearchException: https://lite.duckduckgo.com/lite/ RuntimeError: error sending request for url (https://lite.duckduckgo.com/lite/): operation timed out

Caused by:
    operation timed out

六、初步结论

结合前辈们的结论，KTransformer项目部署硬件配置：

GPU单卡或者多卡对实际运行效率的影响

GPU对实际运行效率提升不大，单卡3090、单卡4090、或者是多卡GPU服务器都没有太大影响，只需要留足20G以上显存（最小可行性实验的话只需要14G显存）即可；

权重规则的运行效率tps的影响

若是多卡服务器，则可以进一步尝试手动编写模型权重卸载规则，使用更多的GPU进行推理，可以一定程度减少内存需求，但对于实际运行效率提升不大。最省钱的方案仍然是单卡GPU+大内存配置；

KTransformers版本对性能的影响

KTransformer目前有V0.2.0、V0.2.1、V0.2.2和V0.3.0，其中V0.3.0目前只有预览版，只支持二进制文件下载和安装，而V0.2.0和V0.2.1支持各类CPU。从V0.3.0开始，只支持AMX CPU，也就是最新几代的Intel CPU。这几个版本实际部署流程和调用指令没有任何区别，若当前CPU支持AMX，则可以考虑使用V3.0进行实验，推理速度会提升40%左右。