BigDL-LLM是一个用于在英特尔XPU上使用INT4/FP4/INT8/FP8运行LLM(大型语言模型)的库,具有非常低的延迟(适用于任何PyTorch模型)。
在英特尔CPU上安装BigDL-LLM
安装
bash
pip install --pre --upgrade bigdl-llm[all]
运行模型
python
#load Hugging Face Transformers model with INT4 optimizations
from bigdl.llm.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)
#run the optimized model on CPU
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
input_ids = tokenizer.encode(input_str, ...)
output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids)
在英特尔GPU上安装BigDL-LLM
安装
bash
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
运行模型
python
#load Hugging Face Transformers model with INT4 optimizations
from bigdl.llm.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)
#run the optimized model on Intel GPU
model = model.to('xpu')
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
input_ids = tokenizer.encode(input_str, ...)
input_ids = input_ids.to('xpu')
output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids.cpu())
使用BigDL-LLM优化Baichuan2
CPU优化方案
使用conda安装
bash
conda create -n llm python=3.9
conda activate llm
pip install bigdl-llm[all] # install bigdl-llm with 'all' option
pip install transformers_stream_generator
运行
python
import torch
from bigdl.llm.transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, trust_remote_code=True, use_cache=True)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Generate predicted tokens
with torch.inference_mode():
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(input_ids, max_new_tokens=args.n_predict)
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
GPU方案
使用conda安装
bash
conda create -n llm python=3.9
conda activate llm
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install transformers_stream_generator
配置OneAPI环境变量
windows:
bash
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
Linux:
bash
source /opt/intel/oneapi/setvars.sh
为了在Arc上获得最佳性能,建议配置几个环境变量
bash
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
运行
python
import torch
import intel_extension_for_pytorch as ipex
from bigdl.llm.transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, trust_remote_code=True, use_cache=True)
model = model.to('xpu')
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Generate predicted tokens
with torch.inference_mode():
input_ids = tokenizer.encode(prompt, return_tensors="pt")
input_ids = input_ids.to('xpu')
# ipex model needs a warmup, then inference time can be accurate
output = model.generate(input_ids, max_new_tokens=args.n_predict)
# start inference
output = model.generate(input_ids, max_new_tokens=args.n_predict)
torch.xpu.synchronize()
output = output.cpu()
output_str = tokenizer.decode(output[0], skip_special_tokens=True)