大模型国产化适配6-基于昇腾910B快速验证ChatGLM3-6B/BaiChuan2-7B模型推理

随着 ChatGPT 的现象级走红，引领了AI大模型时代的变革，从而导致 AI 算力日益紧缺。与此同时，中美贸易战以及美国对华进行AI芯片相关的制裁导致 AI 算力的国产化适配势在必行。之前也分享过一些国产 AI 芯片和 AI 框架。

目前，华为针对昇腾910B进行大模型训练和推理提供了两套解决方法，一套基于MindSpore框架（MindFormers、Mindspore Lite等），一套基于PyTorch框架（ModelZoo-PyTorch、AscendSpeed等）。本文将针对昇腾 910B 基于 MindSpore 框架使用 ChatGLM3-6B/baichuan2-7b 进行模型推理，具体代码放置在GitHub：llm-action。

Mindspore/MindFormers 简介

华为开源的新一代 AI 开源计算框架，其他官方术语就不在这里赘述了，懂得都懂。而 MindFormers 的定位是打造训练->微调->部署的端到端大模型工具套件（类似于飞浆的 PaddleNLP）。

Mindspore Lite 简介

为了更好的性能去部署已经微调训练好的大模型，可以利用 MindSpore 打造的推理引擎 MindSpore_lite，其提供了开箱即用的推理部署方案，帮助用户使能大模型业务。

Lite 推理大致分两步：权重转换导出 MindIR -> Lite 推理。

环境搭建

操作系统版本/架构：Ubuntu 22.04.3 LTS/aarch64
NPU：8卡 910B3 64G (A800 9000 A2)
Python：3.7
NPU 驱动 ：23.0.rc3，下载
NPU 固件 ：6.4.0.4.220，下载
CANN 工具包 ：7.0.RC1 ，下载
MindSpore ：2.2.0，下载
MindFormers ：dev ( commit id: a822b47c )，下载
MindSpore-lite : 2.2.0，下载

yaml 复制代码

+------------------------------------------------------------------------------------------------+
| npu-smi 23.0.rc3                 Version: 23.0.rc3                                             |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 0     910B3               | OK            | 93.5        36                0    / 0             |
| 0                         | 0000:C1:00.0  | 0           0    / 0          21564/ 65536         |
+===========================+===============+====================================================+
| 1     910B3               | OK            | 84.8        35                0    / 0             |
| 0                         | 0000:C2:00.0  | 0           0    / 0          4155 / 65536         |
+===========================+===============+====================================================+
| 2     910B3               | OK            | 84.8        36                0    / 0             |
| 0                         | 0000:81:00.0  | 0           0    / 0          4155 / 65536         |
+===========================+===============+====================================================+
| 3     910B3               | OK            | 94.2        36                0    / 0             |
| 0                         | 0000:82:00.0  | 0           0    / 0          4155 / 65536         |
+===========================+===============+====================================================+
| 4     910B3               | OK            | 89.4        41                0    / 0             |
| 0                         | 0000:01:00.0  | 0           0    / 0          4154 / 65536         |
+===========================+===============+====================================================+
| 5     910B3               | OK            | 90.3        41                0    / 0             |
| 0                         | 0000:02:00.0  | 0           0    / 0          4154 / 65536         |
+===========================+===============+====================================================+
| 6     910B3               | OK            | 103.6       41                0    / 0             |
| 0                         | 0000:41:00.0  | 0           0    / 0          4154 / 65536         |
+===========================+===============+====================================================+
| 7     910B3               | OK            | 84.8        40                0    / 0             |
| 0                         | 0000:42:00.0  | 0           0    / 0          4154 / 65536         |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| 0       0                 | 981040        | python                   | 17309                   |
+===========================+===============+====================================================+
| No running processes found in NPU 1                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 2                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 3                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 4                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 5                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 6                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 7                                                            |
+===========================+===============+====================================================+

NPU驱动固件、CANN软件、升级GCC等就不一一详细讲述了，可看之前的文章。这里为了原始的环境避免干扰，创建一个新的虚拟环境，然后安装相关依赖包：

安装mindspore：

python 复制代码

pip install sympy absl-py
# /usr/local/Ascend/ascend-toolkit/latest/lib64
pip install te-0.4.0-py3-none-any.whl
pip install hccl-0.1.0-py3-none-any.whl

pip install mindspore-2.2.0-cp39-cp39-linux_aarch64.whl

安装 torch、transformers 用于模型格式转换:

ini 复制代码

pip install torch==2.0.0 transformers==4.30.2

pip install numpy sentencepiece ftfy regex tqdm \
pyyaml rouge_chinese nltk jieba datasets gradio==3.23.0 pandas openpyxl et-xmlfile mdtex2html
pip install tokenizers==0.13.3

安装mindformers、mindspore-lite进行模型推理：

ini 复制代码

pip install mindpet==1.0.2 opencv-python-headless
pip install tokenizers==0.15.0

cd mindformers
git checkout a822b47c
python setup.py install

pip install pyarrow==12.0.1 tokenizers==0.15.0
pip install mindspore_lite-2.2.0-cp39-cp39-linux_aarch64.whl

注意：目前的版本模型转换时tokenizers==0.13.3，模型推理时tokenizers==0.15.0。

chatglm3-6b

ChatGLM3 是智谱AI和清华大学 KEG 实验室联合发布的新一代对话预训练模型。ChatGLM3-6B 是 ChatGLM3 系列中的开源模型，在保留了前两代模型对话流畅、部署门槛低等众多优秀特性的基础上，ChatGLM3-6B 引入了如下特性：更强大的基础模型 ，更完整的功能支持 ，更全面的开源序列 。GitHub 地址：ChatGLM3

模型权重转换

将HF模型权重格式转换为MindSpore模型权重格式。当然也可以直接下载官方转换好的MindSpore权重格式。

模型权重合并（model_weigth_merge.py）：

HF格式的多个模型权重文件合并为一个权重文件。

python 复制代码

from transformers import AutoTokenizer, AutoModel
import torch

path = "/root/workspace/model/chatglm3-6b/"

tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
model = AutoModel.from_pretrained(path, trust_remote_code=True)

with open("pt_model_arch.txt", "w") as fp:
    print(model, file=fp, flush=True)
with open("pt_ckpt.txt", "w") as fp:
    for name, param in model.named_parameters():
        fp.write(f"{name} {param.shape} {param.dtype}\n")
torch.save(model.state_dict(), path+"glm3_6b.pth")

模型权重转换（model_weight_convert.py）：

将HF模型权重格式转换为MindSpore模型权重格式。

python 复制代码

import mindspore as ms
import torch as pt
from tqdm import tqdm


path = "/root/workspace/model/chatglm3-6b/"
path_ms = "/root/workspace/model/chatglm3-6b_ms/"


pt_ckpt_path = path+"glm3_6b.pth"
pt_param = pt.load(pt_ckpt_path)

type_map = {"torch.float16": "ms.float16",
            "torch.float32": "ms.float32"}
ms_param = []
with open("check_pt_ckpt.txt", "w") as fp:
    for k, v in tqdm(pt_param.items()):
        if "word_embeddings.weight" in k:
            k = k.replace("word_embeddings.weight", "embedding_table")
        fp.write(f"{k} {v.shape} {v.dtype}\n")
        ms_param.append({"name": k, "data": ms.Tensor(v.numpy())})

ms.save_checkpoint(ms_param, path_ms+"glm3_6b.ckpt")

MindSpore 推理

MindSpore 推理需要分词模型、权重及配置文件：

bash 复制代码

/root/workspace/model/chatglm3-6b_ms
├── glm3_6b.ckpt
├── run_glm3_6b.yaml
├── tokenizer.model
└── tokenizer_config.json

模型推理脚本（model_infer.py）：

模型实例化有以下两种方式，选择其中一种进行实例化即可：

直接根据默认配置实例化
自定义修改配置后实例化

ini 复制代码

import mindspore as ms
from mindformers import AutoConfig, AutoModel, AutoTokenizer

# 指定图模式，指定使用训练卡id
ms.set_context(mode=ms.GRAPH_MODE, device_target="Ascend", device_id=0)

tokenizer = AutoTokenizer.from_pretrained('/root/workspace/model/chatglm3-6b_ms')

"""
# model的实例化有以下两种方式，选择其中一种进行实例化即可
# 1. 直接根据默认配置实例化
model = AutoModel.from_pretrained('/root/workspace/model/chatglm3-6b_ms')

"""
# 2. 自定义修改配置后实例化
config = AutoConfig.from_pretrained('/root/workspace/model/chatglm3-6b_ms/run_glm3_6b.yaml')
config.use_past = True                  # 此处修改默认配置，开启增量推理能够加速推理性能
config.seq_length = 2048                # 根据需求自定义修改其余模型配置
config.checkpoint_name_or_path = "/root/workspace/model/chatglm3-6b_ms/glm3_6b.ckpt"

model = AutoModel.from_config(config)   # 从自定义配置项中实例化模型


role="user"

inputs_list=["你好", "请介绍一下华为"]

for input_item in inputs_list:
    history=[]
    inputs = tokenizer.build_chat_input(input_item, history=history, role=role)
    inputs = inputs['input_ids']
    # 首次调用model.generate()进行推理将包含图编译时间，推理性能显示不准确，多次重复调用以获取准确的推理性能
    outputs = model.generate(inputs, do_sample=False, top_k=1, max_length=2048)
    response = tokenizer.decode(outputs)
    for i, output in enumerate(outputs):
        output = output[len(inputs[i]):]
        response = tokenizer.decode(output)
        print(response)

MindSpore Lite 推理

MindSpore Lite 推理大致分两步：权重转换导出 MindIR 格式 -> 使用 MindSpore Lite 推理。

1. MindIR 导出：

修改模型相关的配置文件（configs/glm3/export_glm3_6b.yaml）：

yaml 复制代码

# export
infer:
    prefill_model_path: "glm3_export/glm3_6b_bs4_seq2048_20231227/prefill.mindir" # 保存mindir的位置
    increment_model_path: "glm3_export/glm3_6b_bs4_seq2048_20231227/inc.mindir"   # 保存mindir的位置
    infer_seq_length: 2048 # 需要保持跟 model-model_config-seq_length 一致
    model_type: mindir

# ==== model config ====
model:
  model_config:
    type: ChatGLM2Config
    batch_size: 4   # only for incremental infer
    num_layers: 28
    padded_vocab_size: 65024
    hidden_size: 4096
    ffn_hidden_size: 13696
    kv_channels: 128
    num_attention_heads: 32
    seq_length: 2048
    hidden_dropout: 0.0
    attention_dropout: 0.0
    layernorm_epsilon: 1e-5
    rmsnorm: True
    apply_residual_connection_post_layernorm: False
    post_layer_norm: True
    add_bias_linear: False
    add_qkv_bias: True
    bias_dropout_fusion: True
    multi_query_attention: True
    multi_query_group_num: 2
    apply_query_key_layer_scaling: True
    attention_softmax_in_fp32: True
    fp32_residual_connection: False
    quantization_bit: 0
    pre_seq_len: None
    prefix_projection: False
    param_init_type: "float16"
    compute_dtype: "float16"
    layernorm_compute_type: "float32"
    use_past: True
    use_flash_attention: False # when use FlashAttention, seq_length should be multiple of 16
    eos_token_id: 2
    pad_token_id: 0
    repetition_penalty: 1.0
    max_decode_length: 256
    checkpoint_name_or_path: "/root/workspace/model/chatglm3-6b_ms/glm3_6b.ckpt"
    top_k: 1
    top_p: 1
    do_sample: True
  arch:
    type: ChatGLM2ForConditionalGeneration

trainer:
  type: CausalLanguageModelingTrainer
  model_name: 'glm3_6b'

运行命令：

bash 复制代码

python mindformers/tools/export.py --config_path configs/glm3/export_glm3_6b.yaml

2. MindSpore Lite 推理

新增推理配置文件（chatglm3-lite.ini）：

ini 复制代码

[ascend_context]
provider=ge

[ge_session_options]
ge.exec.formatMode=1
ge.exec.precision_mode=must_keep_origin_dtype

运行命令：

css 复制代码

python run_infer_main.py \
--device_id 0 \
--model_name glm3_6b \
--tokenizer_path /root/workspace/model/chatglm3-6b_ms/tokenizer.model \
--prefill_model_path glm3_export/glm3_6b_bs4_seq2048_20231227/prefill_graph.mindir \
--increment_model_path glm3_export/glm3_6b_bs4_seq2048_20231227/inc_graph.mindir \
--config_path chatglm3-lite.ini \
--is_sample_acceleration False \
--seq_length 2048 \
--add_special_tokens True

baichuan2-7b

Baichuan2 是由百川智能开发的开源可商用的大规模预训练语言模型，基于 Transformer 结构，支持中英双语，上下文窗口长度为 4096。目前支持Baichuan2-7B和Baichuan2-13B模型，参数量分别为70亿和130亿。GitHub 地址：Baichuan2

模型权重转换

将HF模型权重格式转换为MindSpore模型权重格式。

bash 复制代码

cd /root/mindformers
python ./research/baichuan/convert_weight.py --torch_ckpt_dir /root/workspace/model/Baichuan2-7B-Chat --mindspore_ckpt_path transform.ckpt

MindSpore 推理

基于Generate方式推理:

模型推理脚本（run_baichuan2_generate.py）：

ini 复制代码

from mindspore import context
from mindformers.pipeline import pipeline
from mindformers.models import LlamaConfig
from mindformers import MindFormerConfig

from baichuan2_7b import Baichuan7BV2ForCausalLM
from baichuan2_13b import Baichuan13BV2ForCausalLM
from baichuan2_tokenizer import Baichuan2Tokenizer

model_dict = {
    "baichuan2_7b": Baichuan7BV2ForCausalLM,
    "baichuan2_13b": Baichuan13BV2ForCausalLM,
}

inputs = ["<reserved_106>你是谁？<reserved_107>",
          "<reserved_106>《静夜思》作者是？<reserved_107>",
          "<reserved_106>白日依山尽，下一句是？<reserved_107>"]
batch_size = len(inputs)

# init model
baichuan2_config_path = "/root/mindformers/research/baichuan2/run_baichuan2_7b.yaml"
baichuan2_config = MindFormerConfig(baichuan2_config_path)

baichuan2_config.model.model_config.batch_size = batch_size
baichuan2_model_config = LlamaConfig(**baichuan2_config.model.model_config)
model_name = baichuan2_config.trainer.model_name
baichuan2_network = model_dict[model_name](
    config=baichuan2_model_config
)

# init tokenizer
tokenizer = Baichuan2Tokenizer(
    vocab_file=baichuan2_config.processor.tokenizer.vocab_file
)

# predict using generate
inputs_ids = tokenizer(inputs, max_length=64, padding="max_length")["input_ids"]
outputs = baichuan2_network.generate(inputs_ids,
                                     do_sample=False,
                                     top_k=1,
                                     top_p=1.0,
                                     repetition_penalty=1.05,
                                     temperature=1.0,
                                     max_length=64)
for output in outputs:
    print(tokenizer.decode(output))

配置文件（run_baichuan2_7b.yaml）：

yaml 复制代码

...

model:
  model_config:
    type: LlamaConfig
    batch_size: 1 # add for increase predict
    seq_length: 512
    hidden_size: 4096
    num_layers: 32
    num_heads: 32
    vocab_size: 125696
    multiple_of: 256
    rms_norm_eps: 1.0e-6
    bos_token_id: 1
    eos_token_id: 2
    pad_token_id: 0
    ignore_token_id: -100
    user_token_id: 195
    assistant_token_id: 196
    compute_dtype: "float16"
    layernorm_compute_type: "float32"
    softmax_compute_type: "float32"
    rotary_dtype: "float32"
    param_init_type: "float16"
    use_past: False
    compute_in_2d: False
    use_flash_attention: False
    offset: 0
    checkpoint_name_or_path: "/root/workspace/model/Baichuan2-7B-Chat/transform.ckpt"
    repetition_penalty: 1.05
    temperature: 1.0
    max_decode_length: 512
    top_k: 5
    top_p: 0.85
    do_sample: True
    max_new_tokens: 64
  arch:
    type: Baichuan7BV2ForCausalLM

...

processor:
  return_tensors: ms
  tokenizer:
    unk_token: '<unk>'
    bos_token: '<s>'
    eos_token: '</s>'
    pad_token: '<unk>'
    type: Baichuan2Tokenizer
    vocab_file: '/root/workspace/model/baichuan2-7b/tokenizer.model'
  type: LlamaProcessor
  
...

运行命令：

bash 复制代码

cd /root/mindformers
python ./research/baichuan2/run_baichuan2_generate.py

MindSpore Lite 推理

MindSpore Lite 推理大致分两步：权重转换导出 MindIR 格式 -> 使用 MindSpore Lite 推理。

1. MindIR 导出：

权重文件及配置文件（/root/workspace/model/baichuan2-7b）：

css 复制代码

.
├── baichuan2_7b_910b_export_mindir.yaml
└── transform.ckpt

修改模型相关的配置文件(baichuan2_7b_910b_export_mindir.yaml):

yaml 复制代码

infer:
  prefill_model_path: "baichuan2_7b_export/baichuan2_7b_prefill.mindir"
  increment_model_path: "baichuan2_7b_export/baichuan2_7b_inc.mindir"
  infer_seq_length: 512
  model_type: mindir

# model config
model:
  model_config:
    type: LlamaConfig
    batch_size: 1 # add for increase predict
    seq_length: 512
    hidden_size: 4096
    num_layers: 32
    num_heads: 32
    vocab_size: 125696
    multiple_of: 256
    rms_norm_eps: 1.0e-6
    bos_token_id: 1
    eos_token_id: 2
    pad_token_id: 0
    ignore_token_id: -100
    user_token_id: 195
    assistant_token_id: 196
    compute_dtype: "float16"
    layernorm_compute_type: "float32"
    softmax_compute_type: "float32"
    rotary_dtype: "float32"
    param_init_type: "float16"
    use_past: True
    is_sample_acceleration: False
    compute_in_2d: True
    use_flash_attention: False
    offset: 0
    checkpoint_name_or_path: "/root/workspace/model/baichuan2-7b/transform.ckpt"
    repetition_penalty: 1.05
    temperature: 1.0
    max_decode_length: 512
    top_k: 5
    top_p: 0.85
    do_sample: True
    max_new_tokens: 64
  arch:
    type: Baichuan7BV2ForCausalLM

# trainer config
trainer:
  type: CausalLanguageModelingTrainer
  model_name: 'baichuan2_7b'

参数说明：

infer:

prefill_model_path: 全量图路径
increment_model_path: 增量图路径
infer_seq_length: 推理序列长度
model_type: 推理模型类型

model.model_config:

batch_size：单batch推理设置为1，多batch推理设置为相应的batch数
is_sample_acceleration：后处理加速开关，当前baichuan2模型暂不支持，设置为False

注意： batch_size需与export.py中的batch_size设置保持一致，否则用导出的MindIR图进行单卡推理可能会出现out of memory的问题。使用后处理加速需要配置is_sample_acceleration开关，注意与模型推理脚本中的设置保持一致。

运行命令：

bash 复制代码

python mindformers/tools/export.py --model_dir /root/workspace/model/baichuan2-7b

2. 模型推理

新增推理配置文件（baichuan-ge.cfg）：

ini 复制代码

[ascend_context]
provider=ge

[ge_session_options]
ge.externalWeight=1
ge.exec.atomicCleanPolicy=1
ge.event=notify
ge.exec.staticMemoryPolicy=2
ge.exec.formatMode=1
ge.exec.precision_mode=must_keep_origin_dtype

参数说明

provider=ge：采用GE接口
ge.externalWeight=1：将网络中Const/Constant节点的权重保存在单独的文件中
ge.exec.atomicCleanPolicy=1：不集中清理网络中atomic算子占用的内存
ge.exec.staticMemoryPolicy=2：网络运行使用动态扩展内存方式
ge.exec.precision_mode=must_keep_origin_dtype：选择算子精度模式

运行命令：

css 复制代码

python run_infer_main.py \
--device_id 0 \
--model_name baichuan2_7b \
--seq_length 512 \
--tokenizer_path /root/workspace/model/baichuan2-7b/tokenizer.model \
--prefill_model_path baichuan2_7b_export/baichuan2_7b_prefill_graph.mindir \
--increment_model_path baichuan2_7b_export/baichuan2_7b_inc_graph.mindir \
--config_path baichuan-ge.cfg \
--do_sample False \
--top_k 1 \
--top_p 1.0 \
--repetition_penalty 1.0 \
--temperature 1.0 \
--max_length 512 \
--is_sample_acceleration False \
--add_special_tokens False \
--stream False

参数说明：

device_id: 设备物理ID
model_name: 模型名称
seq_length: 推理序列长度
tokenizer_path: 模型tokenizer路径
prefill_model_path: 全量图路径
increment_model_path: 增量图路径
config_path: GE配置文件路径
do_sample: 是否对候选id进行采样
top_k: 选择top_k个token id作为候选
top_p: 将累积概率小于top_k的token id作为候选
repetition_penalty: 生成单词的惩罚因子，设置为1时不打开
temperature: 温度系数，用来调整下个token的概率
max_length: 能够生成的最大语句长度
is_sample_acceleration: 后处理加速开关，当前baichuan2模型暂不支持，设置为False
add_special_tokens: 对输入token化时是否添加特殊字符
stream: 是否采用流式结果返回
prompt: 输入中加入prompt的内容，Baichuan2可以选择不设置，按默认的prompt进行推理

模型推理性能测试

此外，模型部署到生产环境之前，我们也需要对模型进行推理性能测试。而对于目前的文本生成大模型而言，其推理过程主要分为两个阶段：

预填充（prefill）阶段，这一阶段会以并行方式处理输入提示中的词元；
解码（decoding）阶段，这一阶段文本会以自回归的方式逐个生成"词元"。每个生成的词元都会被添加到输入中，并被重新喂入模型，以生成下一个词元。当LLM输出了特殊的停止词元或满足用户定义的条件（例如：生成了最大数量的词元）时，生成过程就会停止。

词元可以是单词或子词，将文本拆分为词元的确切规则因模型而异。例如，我们可以对比LLaMA模型和OpenAI模型对文本进行分词处理的方式。尽管LLM推理服务供应商经常以基于词元的指标（例如：每秒处理的词元数）来谈论性能，但由于模型分词规则的差异，这些数字在不同模型类型之间并不总是可比较的。例如，Anyscale团队发现，与ChatGPT的分词长度相比，LLaMA 2的分词长度增加了19%（但整体成本要低得多）。HuggingFace的研究人员也发现，与GPT-4相比，对于相同长度的文本，LLaMA 2训练所需的词元要多20%左右。

那么对于大模型推理服务应该如何准确衡量模型的推理速度呢？其常见的评估指标如下：

首个词元生成时间（Time To First Token，简称TTFT）：即用户输入查询后，模型生成第一个输出词元所需的时间。在实时交互中，低时延获取响应非常重要，但在离线工作负载中则不太重要。此指标受处理提示信息并生成首个输出词元所需的时间所驱动。
单个输出词元的生成时间（Time Per Output Token，简称TPOT）：即为每个查询的用户生成一个输出词元所需的时间。这一指标与每个用户对模型"速度"的感知相关。例如，TPOT为100毫秒/词元表示每个用户每秒可处理10个词元，或者每分钟处理约450个词元，那么这一速度远超普通人的阅读速度。
端到端时延：模型为用户生成完整响应所需的总时间。整体响应时延可使用前两个指标计算得出：时延 = （TTFT）+ （TPOT）*（待生成的词元数）。
吞吐量：推理服务在所有用户和请求中每秒可生成的输出词元数。

注意：吞吐量和每个输出词元的时间之间存在权衡：如果我们同时处理 16 个用户查询，与顺序运行查询相比，我们将具有更高的吞吐量，但我们将花费更长的时间为每个用户生成输出令牌。

因此，我们可以通过以上的指标测试模型推理性能。具体可以参考放置在GitHub：llm-action 中的代码。

总结

本文简要介绍了基于昇腾910B使用ChatGLM3-6B/BaiChuan2-7B进行模型推理。不得不说相比几个月之前官方的文档，目前官方的文档详细很多了。

如果觉得我的文章能够能够给您带来帮助，期待您的点赞收藏加关注~~

参考文档

chatglm3: gitee.com/mindspore/m...
baichuan2: gitee.com/mindspore/m...