KT Qwen3.5-35B-A3B 记录

(ktlab) root@DESKTOP-9TRG62N:~/ktransformers# /root/miniconda3/envs/ktlab/bin/python3.11 -m sglang.launch_server --host 0.0.0.0 --port 30000 --model "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B" --kt-weight-path "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B-Q4_K_M" --kt-cpuinfer 9 --kt-threadpool-count 1 --kt-num-gpu-experts 4 --kt-method LLAMAFILE --attention-backend triton --trust-remote-code --mem-fraction-static 0.6 --max-total-tokens 8192 --enable-mixed-chunk --disable-shared-experts-fusion --disable-cuda-graph --skip-server-warmup --disable-flashinfer-autotune

(ktlab) root@DESKTOP-9TRG62N:~/ktransformers# /root/miniconda3/envs/ktlab/bin/python3.11 -m sglang.launch_server --host 0.0.0.0 --port 30000 --model "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B" --kt-weight-path "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B-Q4_K_M" --kt-cpuinfer 9 --kt-threadpool-count 1 --kt-num-gpu-experts 8 --kt-method LLAMAFILE --attention-backend triton --trust-remote-code --mem-fraction-static 0.6 --max-total-tokens 8192 --enable-mixed-chunk --disable-shared-experts-fusion --disable-cuda-graph --skip-server-warmup --disable-flashinfer-autotune

这个显存占用不超过10G

(ktlab) root@DESKTOP-9TRG62N:~/ktransformers# /root/miniconda3/envs/ktlab/bin/python3.11 -m sglang.launch_server --host 0.0.0.0 --port 30000 --model "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B" --kt-weight-path "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B-Q4_K_M" --kt-cpuinfer 11 --kt-threadpool-count 1 --kt-num-gpu-experts 16 --kt-method LLAMAFILE --attention-backend triton --trust-remote-code --mem-fraction-static 0.88 --max-total-tokens 8192 --enable-mixed-chunk --disable-shared-experts-fusion --disable-cuda-graph --skip-server-warmup --disable-flashinfer-autotune

CPU利用率98%、内存使用20G

显存利用12.3G、利用率61%

2026-03-31 18:47:20\] Prefill batch, #new-seq: 1, #new-token: 74, #cached-token: 0, full token usage: 0.01, mamba usage: 0.07, #running-req: 0, #queue-req: 1, input throughput (token/s): 0.00, cuda graph: False \[2026-03-31 18:47:34\] Prefill batch, #new-seq: 1, #new-token: 95, #cached-token: 0, full token usage: 0.02, mamba usage: 0.14, #running-req: 1, #queue-req: 0, input throughput (token/s): 5.28, cuda graph: False \[2026-03-31 18:48:23\] Decode batch, #running-req: 2, #full token: 249, full token usage: 0.03, mamba num: 4, mamba usage: 0.14, cuda graph: False, gen throughput (token/s): 0.20, #queue-req: 0 \[2026-03-31 18:49:06\] Decode batch, #running-req: 2, #full token: 329, full token usage: 0.04, mamba num: 4, mamba usage: 0.14, cuda graph: False, gen throughput (token/s): 1.84, #queue-req: 0 \[2026-03-31 18:49:44\] Decode batch, #running-req: 2, #full token: 409, full token usage: 0.05, mamba num: 4, mamba usage: 0.14, cuda graph: False, gen throughput (token/s): 2.15, #queue-req: 0 (ktlab) root@DESKTOP-9TRG62N:\~/ktransformers# /root/miniconda3/envs/ktlab/bin/python3.11 -m sglang.launch_server --host 0.0.0.0 --port 30000 --model "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B" --kt-weight-path "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B-Q4_K_M" --kt-cpuinfer 6 --kt-threadpool-count 1 --kt-num-gpu-experts 24 --kt-method LLAMAFILE --attention-backend triton --trust-remote-code --mem-fraction-static 0.85 --max-total-tokens 4096 --chunked-prefill-size 1024 --enable-mixed-chunk --disable-shared-experts-fusion --disable-cuda-graph --skip-server-warmup --disable-flashinfer-autotune CPU利用率75%、内存使用20.2G 显存利用12.9G、利用率42% \[2026-03-31 18:58:51\] Load weight end. elapsed=255.03 s, type=Qwen3_5MoeForConditionalGeneration, dtype=torch.bfloat16, avail mem=3.40 GB, mem usage=11.35 GB. \[2026-03-31 18:58:51\] Using KV cache dtype: torch.bfloat16 \[2026-03-31 18:58:51\] Mamba Cache is allocated. max_mamba_cache_size: 9, conv_state size: 0.01GB, ssm_state size: 0.59GB \[2026-03-31 18:58:51\] KV Cache is allocated. #tokens: 4096, K size: 0.04 GB, V size: 0.04 GB \[2026-03-31 18:58:51\] Memory pool end. avail mem=2.66 GB \[2026-03-31 18:58:53\] Using hybrid linear attention backend for hybrid GDN models. \[2026-03-31 18:58:53\] CuTe DSL GDN decode enabled: False \[2026-03-31 18:58:54\] max_total_num_tokens=4096, chunked_prefill_size=1024, max_prefill_tokens=16384, max_running_requests=3, context_len=262144, available_gpu_mem=2.63 GB \[2026-03-31 18:59:00\] INFO: Started server process \[7941

2026-03-31 18:59:00\] INFO: Waiting for application startup. \[2026-03-31 18:59:00\] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': 20, 'top_p': 0.95} \[2026-03-31 18:59:01\] INFO: Application startup complete. \[2026-03-31 18:59:01\] The server is fired up and ready to roll! \[2026-03-31 18:59:01\] INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit) ... \[2026-03-31 19:01:36\] Decode batch, #running-req: 1, #full token: 516, full token usage: 0.13, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 8.79, #queue-req: 1 \[2026-03-31 19:01:41\] Decode batch, #running-req: 1, #full token: 556, full token usage: 0.14, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 8.82, #queue-req: 1 \[2026-03-31 19:01:46\] Decode batch, #running-req: 1, #full token: 596, full token usage: 0.15, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 8.23, #queue-req: 1 \[2026-03-31 19:01:50\] Decode batch, #running-req: 1, #full token: 636, full token usage: 0.16, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 8.82, #queue-req: 1 \[2026-03-31 19:01:55\] Decode batch, #running-req: 1, #full token: 676, full token usage: 0.17, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 9.27, #queue-req: 1 \[2026-03-31 19:01:59\] Decode batch, #running-req: 1, #full token: 716, full token usage: 0.17, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 8.92, #queue-req: 1 \[2026-03-31 19:02:03\] Decode batch, #running-req: 1, #full token: 756, full token usage: 0.18, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 9.22, #queue-req: 1 \[2026-03-31 19:02:08\] Decode batch, #running-req: 1, #full token: 796, full token usage: 0.19, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 8.81, #queue-req: 1 (ktlab) root@DESKTOP-9TRG62N:\~/ktransformers# /root/miniconda3/envs/ktlab/bin/python3.11 -m sglang.launch_server --host 0.0.0.0 --port 30000 --model "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B" --kt-weight-path "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B-Q4_K_M" --kt-cpuinfer 6 --kt-threadpool-count 1 --kt-num-gpu-experts 32 --kt-method LLAMAFILE --attention-backend triton --trust-remote-code --mem-fraction-static 0.96 --max-total-tokens 4096 --chunked-prefill-size 1024 --enable-mixed-chunk --disable-shared-experts-fusion --disable-cuda-graph --skip-server-warmup --disable-flashinfer-autotune CPU利用率70%、内存使用20.6G 显存利用14.6G、利用率47% \[2026-03-31 19:18:13\] Load weight end. elapsed=280.22 s, type=Qwen3_5MoeForConditionalGeneration, dtype=torch.bfloat16, avail mem=1.50 GB, mem usage=13.24 GB. \[2026-03-31 19:18:13\] Using KV cache dtype: torch.bfloat16 \[2026-03-31 19:18:13\] Mamba Cache is allocated. max_mamba_cache_size: 7, conv_state size: 0.01GB, ssm_state size: 0.47GB \[2026-03-31 19:18:13\] KV Cache is allocated. #tokens: 4096, K size: 0.04 GB, V size: 0.04 GB \[2026-03-31 19:18:14\] Memory pool end. avail mem=1.00 GB \[2026-03-31 19:18:15\] Using hybrid linear attention backend for hybrid GDN models. \[2026-03-31 19:18:15\] CuTe DSL GDN decode enabled: False \[2026-03-31 19:18:17\] max_total_num_tokens=4096, chunked_prefill_size=1024, max_prefill_tokens=16384, max_running_requests=2, context_len=262144, available_gpu_mem=1.01 GB \[2026-03-31 19:18:22\] INFO: Started server process \[8142

2026-03-31 19:18:22\] INFO: Waiting for application startup. \[2026-03-31 19:18:22\] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': 20, 'top_p': 0.95} \[2026-03-31 19:18:24\] INFO: Application startup complete. \[2026-03-31 19:18:24\] The server is fired up and ready to roll! \[2026-03-31 19:18:24\] INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit) nvcc warning : incompatible redefinition for option 'std', the last value of this option was used nvcc warning : incompatible redefinition for option 'optimize', the last value of this option was used nvcc fatal : Unsupported gpu architecture 'compute_120' ninja: build stopped: subcommand failed. \[2026-03-31 19:21:34\] Decode batch, #running-req: 1, #full token: 241, full token usage: 0.06, mamba num: 2, mamba usage: 0.29, cuda graph: False, gen throughput (token/s): 9.26, #queue-req: 0 \[2026-03-31 19:21:39\] Decode batch, #running-req: 1, #full token: 281, full token usage: 0.07, mamba num: 2, mamba usage: 0.29, cuda graph: False, gen throughput (token/s): 9.17, #queue-req: 0 \[2026-03-31 19:21:43\] Decode batch, #running-req: 1, #full token: 321, full token usage: 0.08, mamba num: 2, mamba usage: 0.29, cuda graph: False, gen throughput (token/s): 8.87, #queue-req: 0 \[2026-03-31 19:21:48\] Decode batch, #running-req: 1, #full token: 361, full token usage: 0.09, mamba num: 2, mamba usage: 0.29, cuda graph: False, gen throughput (token/s): 8.17, #queue-req: 0 \[2026-03-31 19:21:53\] Decode batch, #running-req: 1, #full token: 401, full token usage: 0.10, mamba num: 2, mamba usage: 0.29, cuda graph: False, gen throughput (token/s): 8.47, #queue-req: 0 \[2026-03-31 19:21:58\] Decode batch, #running-req: 1, #full token: 441, full token usage: 0.11, mamba num: 2, mamba usage: 0.29, cuda graph: False, gen throughput (token/s): 8.10, #queue-req: 0

相关推荐
少许极端2 小时前
算法奇妙屋(三十八)-贪心算法学习之路 5
java·学习·算法·贪心算法
woai33642 小时前
学习JVM-基础篇-Java虚拟机栈&本地方法栈
java·jvm·学习
小陈工2 小时前
Python Web开发入门(三):配置文件管理与环境变量最佳实践
开发语言·jvm·数据库·python·oracle·性能优化·开源
ybwycx2 小时前
springboot3整合knife4j详细版,包会!(不带swagger2玩)
java
切糕师学AI2 小时前
深入解析前端页面在 Safari 与 Chrome 浏览器中的差异及解决方案
前端·chrome·safari
极客先躯2 小时前
高级java每日一道面试题-2025年9月23日-企业集成篇[LangChain4j]-如何与现有的企业中间件集成(Kafka、RabbitMQ)?
java·中间件·java-rabbitmq·稳定性·可靠性·扩展性·langchain4j
cch89182 小时前
PHP vs Java:主流编程语言深度对比
java·开发语言·php
曹牧2 小时前
Tomcat中间件能够提供的能力
java·中间件·tomcat
deep_drink2 小时前
1.1、Python 与编程基础:开发环境、基础工具与第一个 Python 项目
开发语言·人工智能·python·llm