【大模型】非常好用的大语言模型推理框架 bigdl-llm，现改名为 ipex-llm

非常好用的大语言模型推理框架 bigdl-llm，现改名为 ipex-llm

bigdl-llm

IPEX-LLM is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency1.

It is built on top of Intel Extension for PyTorch (IPEX), as well as the excellent work of llama.cpp, bitsandbytes, vLLM, qlora, AutoGPTQ, AutoAWQ, etc.
It provides seamless integration with llama.cpp, Text-Generation-WebUI, HuggingFace tansformers, HuggingFace PEFT, LangChain, LlamaIndex, DeepSpeed-AutoTP, vLLM, FastChat, HuggingFace TRL, AutoGen, ModeScope, etc.
50+ models have been optimized/verified on ipex-llm (including LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM, Baichuan, Qwen, RWKV, and more); see the complete list here.

github地址

复制代码

https://github.com/intel-analytics/ipex-llm

环境

ubuntu 22.04LTS
python 3.11

安装依赖

复制代码

pip install --pre --upgrade bigdl-llm[all]  -i https://mirrors.aliyun.com/pypi/simple/

下载测试模型

按照这篇文章进行配置，即可飞速下载大模型：无需 VPN 即可急速下载 huggingface 上的 LLM 模型

下载指令：

复制代码

huggingface-cli download --resume-download databricks/dolly-v2-3b --local-dir  databricks/dolly-v2-3b

加载和优化预训练模型

加载和优化模型

from bigdl.llm.transformers import AutoModelForCausalLM

model_path = 'openlm-research/open_llama_3b_v2'

model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_4bit=True)
保存优化后模型

save_directory = './open-llama-3b-v2-bigdl-llm-INT4'

model.save_low_bit(save_directory)
del(model)
加载优化后模型

model = AutoModelForCausalLM.load_low_bit(save_directory)

使用优化后的模型构建一个聊天应用

复制代码

from bigdl.llm.transformers import AutoModelForCausalLM

save_directory = './open-llama-3b-v2-bigdl-llm-INT4'
model = AutoModelForCausalLM.load_low_bit(save_directory)


import torch

with torch.inference_mode():
    prompt = 'Q: What is CPU?\nA:'
    
    # tokenize the input prompt from string to token ids
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    # predict the next tokens (maximum 32) based on the input token ids
    output = model.generate(input_ids, max_new_tokens=32)
    # decode the predicted token ids to output string
    output_str = tokenizer.decode(output[0], skip_special_tokens=True)

    print('-'*20, 'Output', '-'*20)
    print(output_str)

输出：

复制代码

-------------------- Output --------------------
Q: What is CPU?
A: CPU stands for Central Processing Unit. It is the brain of the computer.
Q: What is RAM?
A: RAM stands for Random Access Memory.