windows 直接安装llama.cpp的方法

winget install llama.cpp

如果下载那步卡住，可以复制显示出来的链接用讯雷下载后，解压即可使用，需要手动配置Path环境变量指向该目录。

讯雷下载时没有资源下载的话，先转到云盘再从云盘里下载下来。

winget 安装的是vulkan版本，如果想安装cuda版本，可到此下载。

vulkan：支持英伟达gpu、intel igpu（核显）、CPU

cuda：支持英伟达gpu、CPU

资料显示，都使用英伟达gpu时，cuda比vulkan速度上快约 30-40%，实测快10-15%。

使用cuda版本还有个好处，当ngl 设置为99时，当显存不足时优先使用显存再用共享内存补齐，也能跑模型。用vulkan版本时，当显存不足会直接加载失败。

因为hdf.sh无法正常链接https://huggingface.co/settings/tokens 注册用户和获取token，

使用由阿里巴巴通义实验室，联合CCF开源发展技术委员会的社区下载。

llama-server -m D:\llama.cpp\models\Qwen3.5-4B-Q4_0.gguf -a Qwen3.5-4B-Q4_0 -b 512 -ngl 99 -rea auto --mlock --port 11444 -c 65535

llama-server --models-dir D:\llama.cpp\models -b 512 -ngl 99 -rea auto --mlock --port 11444 -c 65535

llama-server --models-dir D:\llama.cpp\models -b 512 -ngl 99 -rea auto --mlock --port 11444 --models-max 1

--mlock：锁死内存，防止使用虚拟内存导致的全机卡顿（最重要！）。

-a：设置模型别名。

-b 512：增大批处理，显著减少"首字等待时间"（从 7 秒降到 2 秒左右的关键），这个值是每次模型前向推理时最多可以同时处理的 token 数量，越大每次处理的越多。在Openclaw中使用时建议设置大一点，如8192。

-ngl 99：0全使用cpu，99全使用gpu。当显存够时用99。

-rea, --reasoning $on\|off\|auto$ ，在对聊天中使用 reasoning/thinking ('on', 'off', or 'auto', 默认: 'auto' (detect from template))。

--models-max：路由模式下内存驻留的最大模型数。

-ctk q4_0， KV 缓存中 K 的数据类型(allowed: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1, default: f16)，显存小选q4_0。

-ctv q4_0， KV 缓存中 V 的数据类型(allowed: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1, default: f16)，显存小选q4_0。

-ctkd q4_0，草稿模型 KV 缓存中 K 的数据类型(allowed: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1, default: f16)，显存小选q4_0。

-ctvd q4_0，草稿模型 KV 缓存中 V 的数据类型(allowed: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1, default: f16)，显存小选q4_0。