限制使用的卡数

这一个很简单，常规操作，只要设置环境变量即可

bash中设置环境变量

ini 复制代码

export CUDA_VISIBLE_DEVICES=1,2    # 从0开始

python脚本中设置环境变量

lua 复制代码

import os

os.environ["CUDA_VISIBLE_DEVICES"] = "1,2"

单卡限制使用显存数量

单卡限制资源其实很简单，就是通过torch的set_per_process_memory_fraction方法实现，具体如下：

ini 复制代码

# 限制GPU使用量
limits = 5    # GPU显存使用限制 单位GB

# 获取单颗GPU显存数量
total_memory = torch.cuda.get_device_properties(0).total_memory    # Byte
total_mem = total_memory / 1024 / 1024 / 1024    # GB

# 限制GPU使用量
ratio = lim / total_mem
torch.cuda.set_per_process_memory_fraction(ratio, 0)

这里需要注意的是，set_per_process_memory_fraction的第二个参数，是GPU的编号

多卡限制使用显存数量

多卡限制显存使用时，会比单卡多一个问题，单个卡可使用的显存量比较少时，会涉及到模型参数分布在不同的卡上。这种情况在大模型上尤其普遍。

具体如何控制模型参数分布在不同卡上，可以通过transformer的Model类的from_pretrained方法中的device_map控制。

device_map的一个常见内容如下（摘自Handling big models for inference）：

makefile 复制代码

device_map = {
    "transformer.wte": "cpu",
    "transformer.wpe": 0,
    "transformer.drop": "cpu",
    "transformer.h.0": "disk"
}

其实就是将模型不同的参数分布于不同的设备（device）上。

上面这个例子就是将不同参数分布在了cpu、gpu(0)、磁盘(disk)上。

device_map中将模型参数分配在不同的device时，需要注意模型的有些模块是不能分布在不同的设备上的，比如一些具有残差结构的网络。

那么，如何获取这些不可分割的模块呢？

可以通过模型对象的_no_split_modules方法获取

那么，如何在获取模型对象的时候不加载模型呢？（参考Huggingface Transformers+Accelerate多卡推理实践（指定GPU和最大显存））

ini 复制代码

# model_dir为模型的路径或名称
config = AutoConfig.from_pretrained(model_dir, trust_remote_code=True)
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.float16, trust_remote_code=True)

得到了model对象之后，就可以获得具体的device_map了

ini 复制代码

# 
map_list = {0:"5GB", 1:"10GB"}    # 对应不同卡号限制的内存量
no_split_modules = model._no_split_modules
device_map = infer_auto_device_map(model, max_memory=map_list, no_split_module_classes=no_split_modules)

一个device_map的样例如下：

json 复制代码

{
  "transformer.embedding": 0,
  "transformer.rotary_pos_emb": 0,
  "transformer.encoder.layers.0": 0,
  "transformer.encoder.layers.1": 0,
  ...
  "transformer.encoder.layers.8": 0,
  "transformer.encoder.layers.9": 1,
  "transformer.encoder.layers.10": 1,
  ...
  "transformer.encoder.layers.27": 1,
  "transformer.encoder.final_layernorm": 1,
  "transformer.output_layer": 1
}

如果不指定infer_auto_device_map函数中的no_split_module_classes参数的话，device_map的样例如下：

json 复制代码

{
  "transformer.embedding": 0,
  "transformer.rotary_pos_emb": 0,
  "transformer.encoder.layers.0": 0,
  "transformer.encoder.layers.1": 0,
  ...
  "transformer.encoder.layers.8": 0,
  "transformer.encoder.layers.9.input_layernorm": 0,
  "transformer.encoder.layers.9.self_attention": 0,
  "transformer.encoder.layers.9.post_attention_layernorm": 0,
  "transformer.encoder.layers.9.mlp": 1,
  "transformer.encoder.layers.10": 1,
  ...
  "transformer.encoder.layers.27": 1,
  "transformer.encoder.final_layernorm": 1,
  "transformer.output_layer": 1
}

可以看到，这里会把第9个encoder层拆分到两块卡上。这样在模型前向传播的时候就会报如下的错误：

vbnet 复制代码

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

汇总上面的内容，可以得到如下的代码：

ini 复制代码

import torch
from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig

######### 限制GPU使用块数
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1,2"


######### 限制GPU使用量
limits = [8, 20]    # GPU显存使用限制 单位GB
map_list = {}

# 获取单颗GPU显存数量
total_memory = torch.cuda.get_device_properties(0).total_memory    # Byte
total_mem = total_memory / 1024 / 1024 / 1024    # GB

# 限制GPU使用量
for i, lim in enumerate(limits):
    ratio = lim / total_mem
    map_list[i] = f"{lim}GB"
    torch.cuda.set_per_process_memory_fraction(ratio, i)
    print(f"set cuda:{i} usage: {lim:2d}GB({ratio*100:7.2f}%)", flush=True)
torch.cuda.empty_cache()

######### 获取device_map
model_dir = model_name_or_path_to_your_model
config = AutoConfig.from_pretrained(model_dir, trust_remote_code=True)
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.float16, trust_remote_code=True)
no_split_modules = model._no_split_modules
print(f"no_split_modules: {no_split_modules}", flush=True)
device_map = infer_auto_device_map(model, max_memory=map_list, no_split_module_classes=no_split_modules)

######### 通过device map加载模型
model = load_checkpoint_and_dispatch(model, checkpoint=model_dir, device_map=device_map)

后记

上面的方法，对于Baichuan、Llama2、QWen等开源大模型是适用的，但是对于ChatGLM，模型加载没有问题，但是在推理时会报错。

arduino 复制代码

Traceback (most recent call last):  File "/data/llms/demo/chatglm2-6b-32k_cli_demo.py", line 103, in <module>    main()  File "/data/llms/demo/chatglm2-6b-32k_cli_demo.py", line 90, in main
    for response, history, past_key_values in model.stream_chat(tokenizer, query, history=history,
  File "/home/server/anaconda3/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 35, in generator_context
    response = gen.send(None)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b-32k/modeling_chatglm.py", line 1072, in stream_chat
    for outputs in self.stream_generate(**inputs, past_key_values=past_key_values,
  File "/home/server/anaconda3/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 35, in generator_context
    response = gen.send(None)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b-32k/modeling_chatglm.py", line 1157, in stream_generate
    outputs = self(
  File "/home/server/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/server/anaconda3/envs/llm/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b-32k/modeling_chatglm.py", line 946, in forward
    transformer_outputs = self.transformer(
  File "/home/server/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b-32k/modeling_chatglm.py", line 836, in forward
    hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
  File "/home/server/anaconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm2-6b-32k/modeling_chatglm.py", line 655, in forward
    presents = torch.cat((presents, kv_cache), dim=0)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument tensors in method wrapper_CUDA_cat)

看上去还是不同device引发的，后面有时间了再来解决。解决上述问题的一个可参考的资料如下：

[BUG/Help] load_model_on_gpus从哪个utils引入的？ #757

更多内容，请关注算法工程笔记公众号

大模型多卡推理时的限制内存方法

限制使用的卡数

单卡限制使用显存数量

多卡限制使用显存数量

后记