LLAMA-CLI 运行千问3.6（R9-7945HX+64G+RTX40608G）

liulilittle2026-05-06 13:06

Max Support:

Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf [37 Token/S]
Qwen3.6-35B-A3B-UD-Q6_K_XL.gguf [31 Token/S]
Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf [16 Token/S]
Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf [13 Token/S]

Slow: 16Token/S

powershell 复制代码

llama-cli -m E:\models\Qwen3.6-35B-A3B\Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf `
   --temp 0.2 --top-p 1.0 --top-k 1 --repeat-penalty 1.0 --presence-penalty 0.0 `
   --seed 42 --jinja -cnv -ngl 100 --n-cpu-moe 32 -t 16 `
   --ctx-size 262144 -n 81920 `
   --chat-template-kwargs '{\"enable_thinking\": true}'

Slow-2: 17Token/S

bash 复制代码

llama-cli -m E:\models\Qwen3.6-35B-A3B\Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf `
     --temp 0.2 --top-p 1.0 --top-k 1 --repeat-penalty 1.0 --presence-penalty 0.0 `
    --seed 42 --jinja -cnv -ngl 100 --n-cpu-moe 32 -t 16 `
    --chat-template-kwargs '{\"enable_thinking\": true}' -c 262144

Fast: 37Token/S

powershell 复制代码

llama-cli -m E:\models\Qwen3.6-35B-A3B\Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf `
   --temp 0.2 --top-p 1.0 --top-k 1 --repeat-penalty 1.0 --presence-penalty 0.0 `
   --seed 42 --jinja -cnv -ngl 100 --n-cpu-moe 32 -t 16 `
   --chat-template-kwargs '{\"enable_thinking\": true}'

API:

powershell 复制代码

llama-server -m E:\models\Qwen3.6-35B-A3B\Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf `
  --host 0.0.0.0 --port 8080 `
  --temp 0.2 --top-p 1.0 --top-k 1 --repeat-penalty 1.0 --presence-penalty 0.0 `
  --seed 42 --jinja -ngl 100 --n-cpu-moe 32 -t 16 `
  --ctx-size 262144 -n 81920 `
  --chat-template-kwargs '{\"enable_thinking\": true}'

查看模型信息：http://127.0.0.1:8080/v1/models
查看模型状态：http://127.0.0.1:8080/v1/status