LLM Xinference 安装使用（支持CPU、Metal、CUDA推理和分布式部署）

1. 详细步骤

1.1 安装

复制代码

# CUDA/CPU
pip install "xinference[transformers]"
pip install "xinference[vllm]"
pip install "xinference[sglang]"

# Metal(MPS)
pip install "xinference[mlx]"
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

注：可能是 nvcc 版本等个人环境配置原因，llama-cpp-python 在 CUDA 上无法使用（C/C++ 环境上是正常的），Metal 的 llama-cpp-python 正常。如需安装 flashinfer 等依赖见官方安装文档：https://inference.readthedocs.io/zh-cn/latest/getting_started/installation.html

1.2 启动

1.2.1 直接启动

简洁命令

复制代码

xinference-local --host 0.0.0.0 --port 9997

多参数命令

设置模型缓存路径和模型来源（Hugging Face/Modelscope）

复制代码

# CUDA/CPU
XINFERENCE_HOME=/path/.xinference XINFERENCE_MODEL_SRC=modelscope xinference-local --host 0.0.0.0 --port 9997

# Metal(MPS)
XINFERENCE_HOME=/path/.xinference XINFERENCE_MODEL_SRC=modelscope PYTORCH_ENABLE_MPS_FALLBACK=1 xinference-local --host 0.0.0.0 --port 9997

1.2.2 集群部署

通过 ifconfig 查看当前服务器IP

1.2.2.1 主服务器启动 Supervisor

复制代码

# 格式
xinference-supervisor -H 当前服务器IP(主服务器IP) --port 9997

# 示例
xinference-supervisor -H 192.168.31.100 --port 9997

1.2.2.2 其他服务器启动 Worker

复制代码

# 格式
xinference-worker -e "http://${主服务器IP}:9997" -H 当前服务器IP(子服务器IP)

# 示例
xinference-worker -e "http://192.168.31.100:9997" -H 192.168.31.101

注：按需添加XINFERENCE_HOME、XINFERENCE_MODEL_SRC、PYTORCH_ENABLE_MPS_FALLBACK等环境变量（启动时参数）

1.3 使用

访问 http://主服务器IP:9997/docs 查看接口文档，访问 http://主服务器IP:9997 正常使用

2. 参考资料

2.1 Xinference

2.1.1 部署文档

本地运行 Xinference

https://inference.readthedocs.io/zh-cn/latest/getting_started/using_xinference.html#run-xinference-locally

集群中部署 Xinference

https://inference.readthedocs.io/zh-cn/latest/getting_started/using_xinference.html#deploy-xinference-in-a-cluster

2.1.2 安装文档

3. 资源

3.1 Xinference

3.1.1 GitHub

官方页面

https://github.com/xorbitsai/inference

https://github.com/xorbitsai/inference/blob/main/README_zh_CN.md

3.1.2 安装文档

SGLang 引擎

https://inference.readthedocs.io/zh-cn/latest/getting_started/installation.html#sglang-backend

其他平台（在昇腾 NPU 上安装）

https://inference.readthedocs.io/zh-cn/latest/getting_started/installation.html#other-platforms

https://inference.readthedocs.io/zh-cn/latest/getting_started/installation_npu.html#installation-npu

LLM Xinference 安装使用（支持CPU、Metal、CUDA推理和分布式部署）

1. 详细步骤

1.1 安装

1.2 启动

1.2.1 直接启动

简洁命令

多参数命令

1.2.2 集群部署

1.2.2.1 主服务器启动 Supervisor

1.2.2.2 其他服务器启动 Worker

1.3 使用

2. 参考资料

2.1 Xinference

2.1.1 部署文档

本地运行 Xinference

集群中部署 Xinference

2.1.2 安装文档

官方页面

Transformers 引擎

vLLM 引擎

Llama.cpp 引擎

MLX 引擎

3. 资源

3.1 Xinference

3.1.1 GitHub

官方页面

3.1.2 安装文档

SGLang 引擎

其他平台（在昇腾 NPU 上安装）