说明
大模型的基本特征就是大,单机单卡部署会很慢,甚至显存不够用。毕竟不是谁都有H100/A100, 能有个3090就不错了。
目前已经有不少框架支持了大模型的分布式部署,可以并行的提高推理速度。不光可以单机多卡,还可以多机多卡。
我自己没啥使用经验,简单罗列下给自己备查。不足之处,欢迎在评论区指出。
框架名称 | 出品方 | 开源地址 |
---|---|---|
FasterTranaformer | 英伟达 | FasterTransformer github |
TGI | huggingface | huggingface/text-generation-inference |
vLLM | 伯克利大学 LMSYS 组织 | github-vllm |
deepspeed | 微软 | github.com/microsoft/DeepSpeed |
lmdeploy | open-mmlab | InternLM/lmdeploy |
TurboTransformers | 腾讯 | Tencent/TurboTransformers |
FasterTranaformer/TensorRT-LLM
faster transformer是英伟达的大模型推理方案,但是后续可能不再维护,因为英伟达推出了一个更新的框架TensorRT-LLM,它目前还在申请使用阶段,未来应该会全面开源吧。
FasterTransformer支持的模型
Models | Framework | FP16 | INT8 (after Turing) | Sparsity (after Ampere) | Tensor parallel | Pipeline parallel | FP8 (after Hopper) |
---|---|---|---|---|---|---|---|
BERT | TensorFlow | Yes | Yes | - | - | - | - |
BERT | PyTorch | Yes | Yes | Yes | Yes | Yes | - |
BERT | Triton backend | Yes | - | - | Yes | Yes | - |
BERT | C++ | Yes | Yes | - | - | - | Yes |
XLNet | C++ | Yes | - | - | - | - | - |
Encoder | TensorFlow | Yes | Yes | - | - | - | - |
Encoder | PyTorch | Yes | Yes | Yes | - | - | - |
Decoder | TensorFlow | Yes | - | - | - | - | - |
Decoder | PyTorch | Yes | - | - | - | - | - |
Decoding | TensorFlow | Yes | - | - | - | - | - |
Decoding | PyTorch | Yes | - | - | - | - | - |
GPT | TensorFlow | Yes | - | - | - | - | - |
GPT/OPT | PyTorch | Yes | - | - | Yes | Yes | Yes |
GPT/OPT | Triton backend | Yes | - | - | Yes | Yes | - |
GPT-MoE | PyTorch | Yes | - | - | Yes | Yes | - |
BLOOM | PyTorch | Yes | - | - | Yes | Yes | - |
BLOOM | Triton backend | Yes | - | - | Yes | Yes | - |
GPT-J | Triton backend | Yes | - | - | Yes | Yes | - |
Longformer | PyTorch | Yes | - | - | - | - | - |
T5/UL2 | PyTorch | Yes | - | - | Yes | Yes | - |
T5 | TensorFlow 2 | Yes | - | - | - | - | - |
T5/UL2 | Triton backend | Yes | - | - | Yes | Yes | - |
T5 | TensorRT | Yes | - | - | Yes | Yes | - |
T5-MoE | PyTorch | Yes | - | - | Yes | Yes | - |
Swin Transformer | PyTorch | Yes | Yes | - | - | - | - |
Swin Transformer | TensorRT | Yes | Yes | - | - | - | - |
ViT | PyTorch | Yes | Yes | - | - | - | - |
ViT | TensorRT | Yes | Yes | - | - | - | - |
GPT-NeoX | PyTorch | Yes | - | - | Yes | Yes | - |
GPT-NeoX | Triton backend | Yes | - | - | Yes | Yes | - |
BART/mBART | PyTorch | Yes | - | - | Yes | Yes | - |
WeNet | C++ | Yes | - | - | - | - | - |
DeBERTa | TensorFlow 2 | Yes | - | - | On-going | On-going | - |
DeBERTa | PyTorch | Yes | - | - | On-going | On-going | - |
参考资料:
H100推理飙升8倍!英伟达官宣开源TensorRT-LLM,支持10+模型
英伟达发布 TensorRT-LLM 模型,性能最高提升 8 倍,何时能正式发售?对此你有哪些期待?
TGI(huggingface/text-generation-inference)
huggingface官方的框架,根据小记:主流推理框架在Llama 2 的上性能比较的数据,TGI跑LLAMA-13b的性能好于vllm。
Optimized architectures
- BLOOM
- FLAN-T5
- Galactica
- GPT-Neox
- Llama
- OPT
- SantaCoder
- Starcoder
- Falcon 7B
- Falcon 40B
- MPT
- Llama V2
- Code Llama
Other architectures are supported on a best effort basis using:
AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")
or
AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")
参考资料:
huggingface/text-generation-inference
vllm
vLLM 是伯克利大学 LMSYS 组织开源的大语言模型高速推理框架,极大地提升了实时场景下的 LLM 服务的吞吐与内存使用效率。
vLLM seamlessly supports many Huggingface models, including the following architectures:
- Aquila (
BAAI/Aquila-7B
,BAAI/AquilaChat-7B
, etc.) - Baichuan (
baichuan-inc/Baichuan-7B
,baichuan-inc/Baichuan-13B-Chat
, etc.) - BLOOM (
bigscience/bloom
,bigscience/bloomz
, etc.) - Falcon (
tiiuae/falcon-7b
,tiiuae/falcon-40b
,tiiuae/falcon-rw-7b
, etc.) - GPT-2 (
gpt2
,gpt2-xl
, etc.) - GPT BigCode (
bigcode/starcoder
,bigcode/gpt_bigcode-santacoder
, etc.) - GPT-J (
EleutherAI/gpt-j-6b
,nomic-ai/gpt4all-j
, etc.) - GPT-NeoX (
EleutherAI/gpt-neox-20b
,databricks/dolly-v2-12b
,stabilityai/stablelm-tuned-alpha-7b
, etc.) - InternLM (
internlm/internlm-7b
,internlm/internlm-chat-7b
, etc.) - LLaMA & LLaMA-2 (
meta-llama/Llama-2-70b-hf
,lmsys/vicuna-13b-v1.3
,young-geng/koala
,openlm-research/open_llama_13b
, etc.) - MPT (
mosaicml/mpt-7b
,mosaicml/mpt-30b
, etc.) - OPT (
facebook/opt-66b
,facebook/opt-iml-max-30b
, etc.) - Qwen (
Qwen/Qwen-7B
,Qwen/Qwen-7B-Chat
, etc.)
参考资料:
比HuggingFace快24倍!伯克利神级LLM推理系统开源,碾压SOTA,让GPU砍半
deepspeed
DeepSpeed是微软推出的大规模模型分布式训练的工具,主要实现了ZeRO并行训练算法。
这个框架可以做训练,也可以推理。我同事使用这个框架对baichuan-13进行推理,功能正常。
DeepSpeed has been used to train many different large-scale models, below is a list of several examples that we are aware of (if you'd like to include your model please submit a PR):
- Megatron-Turing NLG (530B)
- Jurassic-1 (178B)
- BLOOM (176B)
- GLM (130B)
- xTrimoPGLM (100B)
- YaLM (100B)
- GPT-NeoX (20B)
- AlexaTM (20B)
- Turing NLG (17B)
- METRO-LM (5.4B)
参考资料:
github.com/microsoft/DeepSpeed
lmdeploy
LMDeploy 由 MMDeploy 和 MMRazor 团队联合开发,是涵盖了 LLM 任务的全套轻量化、部署和服务解决方案。
支持的模型:
Note
W4A16 推理需要 Ampere 及以上架构的 Nvidia GPU
模型 | 模型并行 | FP16 | KV INT8 | W4A16 | W8A8 |
---|---|---|---|---|---|
Llama | Yes | Yes | Yes | Yes | No |
Llama2 | Yes | Yes | Yes | Yes | No |
InternLM | Yes | Yes | Yes | Yes | No |
QWen-7B | Yes | Yes | Yes | No | No |
Baichuan-7B | Yes | Yes | Yes | Yes | No |
Baichuan2-7B | Yes | Yes | No | No | No |
Code Llama | Yes | Yes | No | No | No |
参考资料:
TurboTransformers
TurboTransformers是腾讯开源的模型加速推理框架。
当前支持的模型种类不算多:
参考资料:
腾讯开源TurboTransformers,推理加速性能超越TensorRT等主流优化引擎