KT Qwen3.5-35B-A3B 记录

(ktlab) root@DESKTOP-9TRG62N:~/ktransformers# /root/miniconda3/envs/ktlab/bin/python3.11 -m sglang.launch_server --host 0.0.0.0 --port 30000 --model "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B" --kt-weight-path "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B-Q4_K_M" --kt-cpuinfer 9 --kt-threadpool-count 1 --kt-num-gpu-experts 4 --kt-method LLAMAFILE --attention-backend triton --trust-remote-code --mem-fraction-static 0.6 --max-total-tokens 8192 --enable-mixed-chunk --disable-shared-experts-fusion --disable-cuda-graph --skip-server-warmup --disable-flashinfer-autotune

(ktlab) root@DESKTOP-9TRG62N:~/ktransformers# /root/miniconda3/envs/ktlab/bin/python3.11 -m sglang.launch_server --host 0.0.0.0 --port 30000 --model "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B" --kt-weight-path "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B-Q4_K_M" --kt-cpuinfer 9 --kt-threadpool-count 1 --kt-num-gpu-experts 8 --kt-method LLAMAFILE --attention-backend triton --trust-remote-code --mem-fraction-static 0.6 --max-total-tokens 8192 --enable-mixed-chunk --disable-shared-experts-fusion --disable-cuda-graph --skip-server-warmup --disable-flashinfer-autotune

这个显存占用不超过10G

(ktlab) root@DESKTOP-9TRG62N:~/ktransformers# /root/miniconda3/envs/ktlab/bin/python3.11 -m sglang.launch_server --host 0.0.0.0 --port 30000 --model "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B" --kt-weight-path "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B-Q4_K_M" --kt-cpuinfer 11 --kt-threadpool-count 1 --kt-num-gpu-experts 16 --kt-method LLAMAFILE --attention-backend triton --trust-remote-code --mem-fraction-static 0.88 --max-total-tokens 8192 --enable-mixed-chunk --disable-shared-experts-fusion --disable-cuda-graph --skip-server-warmup --disable-flashinfer-autotune

CPU利用率98%、内存使用20G

显存利用12.3G、利用率61%

2026-03-31 18:47:20 Prefill batch, #new-seq: 1, #new-token: 74, #cached-token: 0, full token usage: 0.01, mamba usage: 0.07, #running-req: 0, #queue-req: 1, input throughput (token/s): 0.00, cuda graph: False

2026-03-31 18:47:34 Prefill batch, #new-seq: 1, #new-token: 95, #cached-token: 0, full token usage: 0.02, mamba usage: 0.14, #running-req: 1, #queue-req: 0, input throughput (token/s): 5.28, cuda graph: False

2026-03-31 18:48:23 Decode batch, #running-req: 2, #full token: 249, full token usage: 0.03, mamba num: 4, mamba usage: 0.14, cuda graph: False, gen throughput (token/s): 0.20, #queue-req: 0

2026-03-31 18:49:06 Decode batch, #running-req: 2, #full token: 329, full token usage: 0.04, mamba num: 4, mamba usage: 0.14, cuda graph: False, gen throughput (token/s): 1.84, #queue-req: 0

2026-03-31 18:49:44 Decode batch, #running-req: 2, #full token: 409, full token usage: 0.05, mamba num: 4, mamba usage: 0.14, cuda graph: False, gen throughput (token/s): 2.15, #queue-req: 0

(ktlab) root@DESKTOP-9TRG62N:~/ktransformers# /root/miniconda3/envs/ktlab/bin/python3.11 -m sglang.launch_server --host 0.0.0.0 --port 30000 --model "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B" --kt-weight-path "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B-Q4_K_M" --kt-cpuinfer 6 --kt-threadpool-count 1 --kt-num-gpu-experts 24 --kt-method LLAMAFILE --attention-backend triton --trust-remote-code --mem-fraction-static 0.85 --max-total-tokens 4096 --chunked-prefill-size 1024 --enable-mixed-chunk --disable-shared-experts-fusion --disable-cuda-graph --skip-server-warmup --disable-flashinfer-autotune

CPU利用率75%、内存使用20.2G

显存利用12.9G、利用率42%

2026-03-31 18:58:51 Load weight end. elapsed=255.03 s, type=Qwen3_5MoeForConditionalGeneration, dtype=torch.bfloat16, avail mem=3.40 GB, mem usage=11.35 GB.

2026-03-31 18:58:51 Using KV cache dtype: torch.bfloat16

2026-03-31 18:58:51 Mamba Cache is allocated. max_mamba_cache_size: 9, conv_state size: 0.01GB, ssm_state size: 0.59GB

2026-03-31 18:58:51 KV Cache is allocated. #tokens: 4096, K size: 0.04 GB, V size: 0.04 GB

2026-03-31 18:58:51 Memory pool end. avail mem=2.66 GB

2026-03-31 18:58:53 Using hybrid linear attention backend for hybrid GDN models.

2026-03-31 18:58:53 CuTe DSL GDN decode enabled: False

2026-03-31 18:58:54 max_total_num_tokens=4096, chunked_prefill_size=1024, max_prefill_tokens=16384, max_running_requests=3, context_len=262144, available_gpu_mem=2.63 GB

2026-03-31 18:59:00\] INFO: Started server process \[7941

2026-03-31 18:59:00 INFO: Waiting for application startup.

2026-03-31 18:59:00 Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': 20, 'top_p': 0.95}

2026-03-31 18:59:01 INFO: Application startup complete.

2026-03-31 18:59:01 The server is fired up and ready to roll!

2026-03-31 18:59:01 INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)

...

2026-03-31 19:01:36 Decode batch, #running-req: 1, #full token: 516, full token usage: 0.13, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 8.79, #queue-req: 1

2026-03-31 19:01:41 Decode batch, #running-req: 1, #full token: 556, full token usage: 0.14, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 8.82, #queue-req: 1

2026-03-31 19:01:46 Decode batch, #running-req: 1, #full token: 596, full token usage: 0.15, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 8.23, #queue-req: 1

2026-03-31 19:01:50 Decode batch, #running-req: 1, #full token: 636, full token usage: 0.16, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 8.82, #queue-req: 1

2026-03-31 19:01:55 Decode batch, #running-req: 1, #full token: 676, full token usage: 0.17, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 9.27, #queue-req: 1

2026-03-31 19:01:59 Decode batch, #running-req: 1, #full token: 716, full token usage: 0.17, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 8.92, #queue-req: 1

2026-03-31 19:02:03 Decode batch, #running-req: 1, #full token: 756, full token usage: 0.18, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 9.22, #queue-req: 1

2026-03-31 19:02:08 Decode batch, #running-req: 1, #full token: 796, full token usage: 0.19, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 8.81, #queue-req: 1

(ktlab) root@DESKTOP-9TRG62N:~/ktransformers# /root/miniconda3/envs/ktlab/bin/python3.11 -m sglang.launch_server --host 0.0.0.0 --port 30000 --model "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B" --kt-weight-path "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B-Q4_K_M" --kt-cpuinfer 6 --kt-threadpool-count 1 --kt-num-gpu-experts 32 --kt-method LLAMAFILE --attention-backend triton --trust-remote-code --mem-fraction-static 0.96 --max-total-tokens 4096 --chunked-prefill-size 1024 --enable-mixed-chunk --disable-shared-experts-fusion --disable-cuda-graph --skip-server-warmup --disable-flashinfer-autotune

CPU利用率70%、内存使用20.6G

显存利用14.6G、利用率47%

2026-03-31 19:18:13 Load weight end. elapsed=280.22 s, type=Qwen3_5MoeForConditionalGeneration, dtype=torch.bfloat16, avail mem=1.50 GB, mem usage=13.24 GB.

2026-03-31 19:18:13 Using KV cache dtype: torch.bfloat16

2026-03-31 19:18:13 Mamba Cache is allocated. max_mamba_cache_size: 7, conv_state size: 0.01GB, ssm_state size: 0.47GB

2026-03-31 19:18:13 KV Cache is allocated. #tokens: 4096, K size: 0.04 GB, V size: 0.04 GB

2026-03-31 19:18:14 Memory pool end. avail mem=1.00 GB

2026-03-31 19:18:15 Using hybrid linear attention backend for hybrid GDN models.

2026-03-31 19:18:15 CuTe DSL GDN decode enabled: False

2026-03-31 19:18:17 max_total_num_tokens=4096, chunked_prefill_size=1024, max_prefill_tokens=16384, max_running_requests=2, context_len=262144, available_gpu_mem=1.01 GB

2026-03-31 19:18:22\] INFO: Started server process \[8142

2026-03-31 19:18:22 INFO: Waiting for application startup.

2026-03-31 19:18:22 Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': 20, 'top_p': 0.95}

2026-03-31 19:18:24 INFO: Application startup complete.

2026-03-31 19:18:24 The server is fired up and ready to roll!

2026-03-31 19:18:24 INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)

nvcc warning : incompatible redefinition for option 'std', the last value of this option was used

nvcc warning : incompatible redefinition for option 'optimize', the last value of this option was used

nvcc fatal : Unsupported gpu architecture 'compute_120'

ninja: build stopped: subcommand failed.

2026-03-31 19:21:34 Decode batch, #running-req: 1, #full token: 241, full token usage: 0.06, mamba num: 2, mamba usage: 0.29, cuda graph: False, gen throughput (token/s): 9.26, #queue-req: 0

2026-03-31 19:21:39 Decode batch, #running-req: 1, #full token: 281, full token usage: 0.07, mamba num: 2, mamba usage: 0.29, cuda graph: False, gen throughput (token/s): 9.17, #queue-req: 0

2026-03-31 19:21:43 Decode batch, #running-req: 1, #full token: 321, full token usage: 0.08, mamba num: 2, mamba usage: 0.29, cuda graph: False, gen throughput (token/s): 8.87, #queue-req: 0

2026-03-31 19:21:48 Decode batch, #running-req: 1, #full token: 361, full token usage: 0.09, mamba num: 2, mamba usage: 0.29, cuda graph: False, gen throughput (token/s): 8.17, #queue-req: 0

2026-03-31 19:21:53 Decode batch, #running-req: 1, #full token: 401, full token usage: 0.10, mamba num: 2, mamba usage: 0.29, cuda graph: False, gen throughput (token/s): 8.47, #queue-req: 0

2026-03-31 19:21:58 Decode batch, #running-req: 1, #full token: 441, full token usage: 0.11, mamba num: 2, mamba usage: 0.29, cuda graph: False, gen throughput (token/s): 8.10, #queue-req: 0

相关推荐
夏天测几秒前
微信小程序自动化漏洞挖掘流水线:从缓存提取到密钥验证全流程实战
python·网络安全·微信小程序·漏洞挖掘
是苏浙14 分钟前
Java实现链表1
java·开发语言
未若君雅裁16 分钟前
上传数据安全:对称加密、非对称加密、签名与重放防护
java·安全
叫我:松哥21 分钟前
基于Python的共享单车租赁数据分析与预测系统,技术栈flask+boostrap+随机森林+XGBoost
人工智能·python·深度学习·算法·随机森林·数据分析·flask
可乐ea24 分钟前
【Spring Boot + MyBatis|第7篇】JWT 登录认证与拦截器实现
java·spring boot·后端·mybatis·状态模式
梵得儿SHI30 分钟前
Vue 项目实战与性能优化全攻略:从代码、渲染到首屏,一站式解决卡顿慢加载
前端·vue.js·性能优化·vite·前端面试·前端优化·首屏优化
Li#31 分钟前
web端电商项目自动下单发货评价晒图需要用到的能力
python·自动化
ShyanZh33 分钟前
【skill】HTML PPT Skill:用 Claude Code 一句话生成专业演示文稿
前端·ai·html·powerpoint·skill
步步为营DotNet39 分钟前
借助 C# 14 特性强化 .NET 后端数据验证的深度实践
java·c#·.net
AI视觉网奇39 分钟前
three教学 3d资产拼接源代码
前端·css·css3