(ktlab) root@DESKTOP-9TRG62N:~/ktransformers# /root/miniconda3/envs/ktlab/bin/python3.11 -m sglang.launch_server --host 0.0.0.0 --port 30000 --model "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B" --kt-weight-path "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B-Q4_K_M" --kt-cpuinfer 9 --kt-threadpool-count 1 --kt-num-gpu-experts 4 --kt-method LLAMAFILE --attention-backend triton --trust-remote-code --mem-fraction-static 0.6 --max-total-tokens 8192 --enable-mixed-chunk --disable-shared-experts-fusion --disable-cuda-graph --skip-server-warmup --disable-flashinfer-autotune
(ktlab) root@DESKTOP-9TRG62N:~/ktransformers# /root/miniconda3/envs/ktlab/bin/python3.11 -m sglang.launch_server --host 0.0.0.0 --port 30000 --model "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B" --kt-weight-path "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B-Q4_K_M" --kt-cpuinfer 9 --kt-threadpool-count 1 --kt-num-gpu-experts 8 --kt-method LLAMAFILE --attention-backend triton --trust-remote-code --mem-fraction-static 0.6 --max-total-tokens 8192 --enable-mixed-chunk --disable-shared-experts-fusion --disable-cuda-graph --skip-server-warmup --disable-flashinfer-autotune
这个显存占用不超过10G
(ktlab) root@DESKTOP-9TRG62N:~/ktransformers# /root/miniconda3/envs/ktlab/bin/python3.11 -m sglang.launch_server --host 0.0.0.0 --port 30000 --model "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B" --kt-weight-path "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B-Q4_K_M" --kt-cpuinfer 11 --kt-threadpool-count 1 --kt-num-gpu-experts 16 --kt-method LLAMAFILE --attention-backend triton --trust-remote-code --mem-fraction-static 0.88 --max-total-tokens 8192 --enable-mixed-chunk --disable-shared-experts-fusion --disable-cuda-graph --skip-server-warmup --disable-flashinfer-autotune
CPU利用率98%、内存使用20G
显存利用12.3G、利用率61%
2026-03-31 18:47:20 Prefill batch, #new-seq: 1, #new-token: 74, #cached-token: 0, full token usage: 0.01, mamba usage: 0.07, #running-req: 0, #queue-req: 1, input throughput (token/s): 0.00, cuda graph: False
2026-03-31 18:47:34 Prefill batch, #new-seq: 1, #new-token: 95, #cached-token: 0, full token usage: 0.02, mamba usage: 0.14, #running-req: 1, #queue-req: 0, input throughput (token/s): 5.28, cuda graph: False
2026-03-31 18:48:23 Decode batch, #running-req: 2, #full token: 249, full token usage: 0.03, mamba num: 4, mamba usage: 0.14, cuda graph: False, gen throughput (token/s): 0.20, #queue-req: 0
2026-03-31 18:49:06 Decode batch, #running-req: 2, #full token: 329, full token usage: 0.04, mamba num: 4, mamba usage: 0.14, cuda graph: False, gen throughput (token/s): 1.84, #queue-req: 0
2026-03-31 18:49:44 Decode batch, #running-req: 2, #full token: 409, full token usage: 0.05, mamba num: 4, mamba usage: 0.14, cuda graph: False, gen throughput (token/s): 2.15, #queue-req: 0
(ktlab) root@DESKTOP-9TRG62N:~/ktransformers# /root/miniconda3/envs/ktlab/bin/python3.11 -m sglang.launch_server --host 0.0.0.0 --port 30000 --model "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B" --kt-weight-path "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B-Q4_K_M" --kt-cpuinfer 6 --kt-threadpool-count 1 --kt-num-gpu-experts 24 --kt-method LLAMAFILE --attention-backend triton --trust-remote-code --mem-fraction-static 0.85 --max-total-tokens 4096 --chunked-prefill-size 1024 --enable-mixed-chunk --disable-shared-experts-fusion --disable-cuda-graph --skip-server-warmup --disable-flashinfer-autotune
CPU利用率75%、内存使用20.2G
显存利用12.9G、利用率42%
2026-03-31 18:58:51 Load weight end. elapsed=255.03 s, type=Qwen3_5MoeForConditionalGeneration, dtype=torch.bfloat16, avail mem=3.40 GB, mem usage=11.35 GB.
2026-03-31 18:58:51 Using KV cache dtype: torch.bfloat16
2026-03-31 18:58:51 Mamba Cache is allocated. max_mamba_cache_size: 9, conv_state size: 0.01GB, ssm_state size: 0.59GB
2026-03-31 18:58:51 KV Cache is allocated. #tokens: 4096, K size: 0.04 GB, V size: 0.04 GB
2026-03-31 18:58:51 Memory pool end. avail mem=2.66 GB
2026-03-31 18:58:53 Using hybrid linear attention backend for hybrid GDN models.
2026-03-31 18:58:53 CuTe DSL GDN decode enabled: False
2026-03-31 18:58:54 max_total_num_tokens=4096, chunked_prefill_size=1024, max_prefill_tokens=16384, max_running_requests=3, context_len=262144, available_gpu_mem=2.63 GB
2026-03-31 18:59:00\] INFO: Started server process \[7941
2026-03-31 18:59:00 INFO: Waiting for application startup.
2026-03-31 18:59:00 Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': 20, 'top_p': 0.95}
2026-03-31 18:59:01 INFO: Application startup complete.
2026-03-31 18:59:01 The server is fired up and ready to roll!
2026-03-31 18:59:01 INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
...
2026-03-31 19:01:36 Decode batch, #running-req: 1, #full token: 516, full token usage: 0.13, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 8.79, #queue-req: 1
2026-03-31 19:01:41 Decode batch, #running-req: 1, #full token: 556, full token usage: 0.14, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 8.82, #queue-req: 1
2026-03-31 19:01:46 Decode batch, #running-req: 1, #full token: 596, full token usage: 0.15, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 8.23, #queue-req: 1
2026-03-31 19:01:50 Decode batch, #running-req: 1, #full token: 636, full token usage: 0.16, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 8.82, #queue-req: 1
2026-03-31 19:01:55 Decode batch, #running-req: 1, #full token: 676, full token usage: 0.17, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 9.27, #queue-req: 1
2026-03-31 19:01:59 Decode batch, #running-req: 1, #full token: 716, full token usage: 0.17, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 8.92, #queue-req: 1
2026-03-31 19:02:03 Decode batch, #running-req: 1, #full token: 756, full token usage: 0.18, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 9.22, #queue-req: 1
2026-03-31 19:02:08 Decode batch, #running-req: 1, #full token: 796, full token usage: 0.19, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 8.81, #queue-req: 1
(ktlab) root@DESKTOP-9TRG62N:~/ktransformers# /root/miniconda3/envs/ktlab/bin/python3.11 -m sglang.launch_server --host 0.0.0.0 --port 30000 --model "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B" --kt-weight-path "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B-Q4_K_M" --kt-cpuinfer 6 --kt-threadpool-count 1 --kt-num-gpu-experts 32 --kt-method LLAMAFILE --attention-backend triton --trust-remote-code --mem-fraction-static 0.96 --max-total-tokens 4096 --chunked-prefill-size 1024 --enable-mixed-chunk --disable-shared-experts-fusion --disable-cuda-graph --skip-server-warmup --disable-flashinfer-autotune
CPU利用率70%、内存使用20.6G
显存利用14.6G、利用率47%
2026-03-31 19:18:13 Load weight end. elapsed=280.22 s, type=Qwen3_5MoeForConditionalGeneration, dtype=torch.bfloat16, avail mem=1.50 GB, mem usage=13.24 GB.
2026-03-31 19:18:13 Using KV cache dtype: torch.bfloat16
2026-03-31 19:18:13 Mamba Cache is allocated. max_mamba_cache_size: 7, conv_state size: 0.01GB, ssm_state size: 0.47GB
2026-03-31 19:18:13 KV Cache is allocated. #tokens: 4096, K size: 0.04 GB, V size: 0.04 GB
2026-03-31 19:18:14 Memory pool end. avail mem=1.00 GB
2026-03-31 19:18:15 Using hybrid linear attention backend for hybrid GDN models.
2026-03-31 19:18:15 CuTe DSL GDN decode enabled: False
2026-03-31 19:18:17 max_total_num_tokens=4096, chunked_prefill_size=1024, max_prefill_tokens=16384, max_running_requests=2, context_len=262144, available_gpu_mem=1.01 GB
2026-03-31 19:18:22\] INFO: Started server process \[8142
2026-03-31 19:18:22 INFO: Waiting for application startup.
2026-03-31 19:18:22 Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': 20, 'top_p': 0.95}
2026-03-31 19:18:24 INFO: Application startup complete.
2026-03-31 19:18:24 The server is fired up and ready to roll!
2026-03-31 19:18:24 INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
nvcc warning : incompatible redefinition for option 'std', the last value of this option was used
nvcc warning : incompatible redefinition for option 'optimize', the last value of this option was used
nvcc fatal : Unsupported gpu architecture 'compute_120'
ninja: build stopped: subcommand failed.
2026-03-31 19:21:34 Decode batch, #running-req: 1, #full token: 241, full token usage: 0.06, mamba num: 2, mamba usage: 0.29, cuda graph: False, gen throughput (token/s): 9.26, #queue-req: 0
2026-03-31 19:21:39 Decode batch, #running-req: 1, #full token: 281, full token usage: 0.07, mamba num: 2, mamba usage: 0.29, cuda graph: False, gen throughput (token/s): 9.17, #queue-req: 0
2026-03-31 19:21:43 Decode batch, #running-req: 1, #full token: 321, full token usage: 0.08, mamba num: 2, mamba usage: 0.29, cuda graph: False, gen throughput (token/s): 8.87, #queue-req: 0
2026-03-31 19:21:48 Decode batch, #running-req: 1, #full token: 361, full token usage: 0.09, mamba num: 2, mamba usage: 0.29, cuda graph: False, gen throughput (token/s): 8.17, #queue-req: 0
2026-03-31 19:21:53 Decode batch, #running-req: 1, #full token: 401, full token usage: 0.10, mamba num: 2, mamba usage: 0.29, cuda graph: False, gen throughput (token/s): 8.47, #queue-req: 0
2026-03-31 19:21:58 Decode batch, #running-req: 1, #full token: 441, full token usage: 0.11, mamba num: 2, mamba usage: 0.29, cuda graph: False, gen throughput (token/s): 8.10, #queue-req: 0