KT Qwen3.5-35B-A3B 记录

(ktlab) root@DESKTOP-9TRG62N:~/ktransformers# /root/miniconda3/envs/ktlab/bin/python3.11 -m sglang.launch_server --host 0.0.0.0 --port 30000 --model "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B" --kt-weight-path "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B-Q4_K_M" --kt-cpuinfer 9 --kt-threadpool-count 1 --kt-num-gpu-experts 4 --kt-method LLAMAFILE --attention-backend triton --trust-remote-code --mem-fraction-static 0.6 --max-total-tokens 8192 --enable-mixed-chunk --disable-shared-experts-fusion --disable-cuda-graph --skip-server-warmup --disable-flashinfer-autotune

(ktlab) root@DESKTOP-9TRG62N:~/ktransformers# /root/miniconda3/envs/ktlab/bin/python3.11 -m sglang.launch_server --host 0.0.0.0 --port 30000 --model "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B" --kt-weight-path "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B-Q4_K_M" --kt-cpuinfer 9 --kt-threadpool-count 1 --kt-num-gpu-experts 8 --kt-method LLAMAFILE --attention-backend triton --trust-remote-code --mem-fraction-static 0.6 --max-total-tokens 8192 --enable-mixed-chunk --disable-shared-experts-fusion --disable-cuda-graph --skip-server-warmup --disable-flashinfer-autotune

这个显存占用不超过10G

(ktlab) root@DESKTOP-9TRG62N:~/ktransformers# /root/miniconda3/envs/ktlab/bin/python3.11 -m sglang.launch_server --host 0.0.0.0 --port 30000 --model "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B" --kt-weight-path "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B-Q4_K_M" --kt-cpuinfer 11 --kt-threadpool-count 1 --kt-num-gpu-experts 16 --kt-method LLAMAFILE --attention-backend triton --trust-remote-code --mem-fraction-static 0.88 --max-total-tokens 8192 --enable-mixed-chunk --disable-shared-experts-fusion --disable-cuda-graph --skip-server-warmup --disable-flashinfer-autotune

CPU利用率98%、内存使用20G

显存利用12.3G、利用率61%

$2026-03-31 18:47:20$ Prefill batch, #new-seq: 1, #new-token: 74, #cached-token: 0, full token usage: 0.01, mamba usage: 0.07, #running-req: 0, #queue-req: 1, input throughput (token/s): 0.00, cuda graph: False

$2026-03-31 18:47:34$ Prefill batch, #new-seq: 1, #new-token: 95, #cached-token: 0, full token usage: 0.02, mamba usage: 0.14, #running-req: 1, #queue-req: 0, input throughput (token/s): 5.28, cuda graph: False

$2026-03-31 18:48:23$ Decode batch, #running-req: 2, #full token: 249, full token usage: 0.03, mamba num: 4, mamba usage: 0.14, cuda graph: False, gen throughput (token/s): 0.20, #queue-req: 0

$2026-03-31 18:49:06$ Decode batch, #running-req: 2, #full token: 329, full token usage: 0.04, mamba num: 4, mamba usage: 0.14, cuda graph: False, gen throughput (token/s): 1.84, #queue-req: 0

$2026-03-31 18:49:44$ Decode batch, #running-req: 2, #full token: 409, full token usage: 0.05, mamba num: 4, mamba usage: 0.14, cuda graph: False, gen throughput (token/s): 2.15, #queue-req: 0

(ktlab) root@DESKTOP-9TRG62N:~/ktransformers# /root/miniconda3/envs/ktlab/bin/python3.11 -m sglang.launch_server --host 0.0.0.0 --port 30000 --model "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B" --kt-weight-path "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B-Q4_K_M" --kt-cpuinfer 6 --kt-threadpool-count 1 --kt-num-gpu-experts 24 --kt-method LLAMAFILE --attention-backend triton --trust-remote-code --mem-fraction-static 0.85 --max-total-tokens 4096 --chunked-prefill-size 1024 --enable-mixed-chunk --disable-shared-experts-fusion --disable-cuda-graph --skip-server-warmup --disable-flashinfer-autotune

CPU利用率75%、内存使用20.2G

显存利用12.9G、利用率42%

$2026-03-31 18:58:51$ Load weight end. elapsed=255.03 s, type=Qwen3_5MoeForConditionalGeneration, dtype=torch.bfloat16, avail mem=3.40 GB, mem usage=11.35 GB.

$2026-03-31 18:58:51$ Using KV cache dtype: torch.bfloat16

$2026-03-31 18:58:51$ Mamba Cache is allocated. max_mamba_cache_size: 9, conv_state size: 0.01GB, ssm_state size: 0.59GB

$2026-03-31 18:58:51$ KV Cache is allocated. #tokens: 4096, K size: 0.04 GB, V size: 0.04 GB

$2026-03-31 18:58:51$ Memory pool end. avail mem=2.66 GB

$2026-03-31 18:58:53$ Using hybrid linear attention backend for hybrid GDN models.

$2026-03-31 18:58:53$ CuTe DSL GDN decode enabled: False

$2026-03-31 18:58:54$ max_total_num_tokens=4096, chunked_prefill_size=1024, max_prefill_tokens=16384, max_running_requests=3, context_len=262144, available_gpu_mem=2.63 GB

2026-03-31 18:59:00\] INFO: Started server process \[7941

$2026-03-31 18:59:00$ INFO: Waiting for application startup.

$2026-03-31 18:59:00$ Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': 20, 'top_p': 0.95}

$2026-03-31 18:59:01$ INFO: Application startup complete.

$2026-03-31 18:59:01$ The server is fired up and ready to roll!

$2026-03-31 18:59:01$ INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)

...

$2026-03-31 19:01:36$ Decode batch, #running-req: 1, #full token: 516, full token usage: 0.13, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 8.79, #queue-req: 1

$2026-03-31 19:01:41$ Decode batch, #running-req: 1, #full token: 556, full token usage: 0.14, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 8.82, #queue-req: 1

$2026-03-31 19:01:46$ Decode batch, #running-req: 1, #full token: 596, full token usage: 0.15, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 8.23, #queue-req: 1

$2026-03-31 19:01:50$ Decode batch, #running-req: 1, #full token: 636, full token usage: 0.16, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 8.82, #queue-req: 1

$2026-03-31 19:01:55$ Decode batch, #running-req: 1, #full token: 676, full token usage: 0.17, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 9.27, #queue-req: 1

$2026-03-31 19:01:59$ Decode batch, #running-req: 1, #full token: 716, full token usage: 0.17, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 8.92, #queue-req: 1

$2026-03-31 19:02:03$ Decode batch, #running-req: 1, #full token: 756, full token usage: 0.18, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 9.22, #queue-req: 1

$2026-03-31 19:02:08$ Decode batch, #running-req: 1, #full token: 796, full token usage: 0.19, mamba num: 2, mamba usage: 0.22, cuda graph: False, gen throughput (token/s): 8.81, #queue-req: 1

(ktlab) root@DESKTOP-9TRG62N:~/ktransformers# /root/miniconda3/envs/ktlab/bin/python3.11 -m sglang.launch_server --host 0.0.0.0 --port 30000 --model "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B" --kt-weight-path "/mnt/d/Caches/LLM Models/Qwen3.5/Qwen3.5-35B-A3B-Q4_K_M" --kt-cpuinfer 6 --kt-threadpool-count 1 --kt-num-gpu-experts 32 --kt-method LLAMAFILE --attention-backend triton --trust-remote-code --mem-fraction-static 0.96 --max-total-tokens 4096 --chunked-prefill-size 1024 --enable-mixed-chunk --disable-shared-experts-fusion --disable-cuda-graph --skip-server-warmup --disable-flashinfer-autotune

CPU利用率70%、内存使用20.6G

显存利用14.6G、利用率47%

$2026-03-31 19:18:13$ Load weight end. elapsed=280.22 s, type=Qwen3_5MoeForConditionalGeneration, dtype=torch.bfloat16, avail mem=1.50 GB, mem usage=13.24 GB.

$2026-03-31 19:18:13$ Using KV cache dtype: torch.bfloat16

$2026-03-31 19:18:13$ Mamba Cache is allocated. max_mamba_cache_size: 7, conv_state size: 0.01GB, ssm_state size: 0.47GB

$2026-03-31 19:18:13$ KV Cache is allocated. #tokens: 4096, K size: 0.04 GB, V size: 0.04 GB

$2026-03-31 19:18:14$ Memory pool end. avail mem=1.00 GB

$2026-03-31 19:18:15$ Using hybrid linear attention backend for hybrid GDN models.

$2026-03-31 19:18:15$ CuTe DSL GDN decode enabled: False

$2026-03-31 19:18:17$ max_total_num_tokens=4096, chunked_prefill_size=1024, max_prefill_tokens=16384, max_running_requests=2, context_len=262144, available_gpu_mem=1.01 GB

2026-03-31 19:18:22\] INFO: Started server process \[8142

$2026-03-31 19:18:22$ INFO: Waiting for application startup.

$2026-03-31 19:18:22$ Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': 20, 'top_p': 0.95}

$2026-03-31 19:18:24$ INFO: Application startup complete.

$2026-03-31 19:18:24$ The server is fired up and ready to roll!

$2026-03-31 19:18:24$ INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)

nvcc warning : incompatible redefinition for option 'std', the last value of this option was used

nvcc warning : incompatible redefinition for option 'optimize', the last value of this option was used

nvcc fatal : Unsupported gpu architecture 'compute_120'

ninja: build stopped: subcommand failed.

$2026-03-31 19:21:34$ Decode batch, #running-req: 1, #full token: 241, full token usage: 0.06, mamba num: 2, mamba usage: 0.29, cuda graph: False, gen throughput (token/s): 9.26, #queue-req: 0

$2026-03-31 19:21:39$ Decode batch, #running-req: 1, #full token: 281, full token usage: 0.07, mamba num: 2, mamba usage: 0.29, cuda graph: False, gen throughput (token/s): 9.17, #queue-req: 0

$2026-03-31 19:21:43$ Decode batch, #running-req: 1, #full token: 321, full token usage: 0.08, mamba num: 2, mamba usage: 0.29, cuda graph: False, gen throughput (token/s): 8.87, #queue-req: 0

$2026-03-31 19:21:48$ Decode batch, #running-req: 1, #full token: 361, full token usage: 0.09, mamba num: 2, mamba usage: 0.29, cuda graph: False, gen throughput (token/s): 8.17, #queue-req: 0

$2026-03-31 19:21:53$ Decode batch, #running-req: 1, #full token: 401, full token usage: 0.10, mamba num: 2, mamba usage: 0.29, cuda graph: False, gen throughput (token/s): 8.47, #queue-req: 0

$2026-03-31 19:21:58$ Decode batch, #running-req: 1, #full token: 441, full token usage: 0.11, mamba num: 2, mamba usage: 0.29, cuda graph: False, gen throughput (token/s): 8.10, #queue-req: 0