centos-LLM-生物信息-BioGPT-使用1

参考：

GitHub - microsoft/BioGPT

BioGPT：用于生物医学文本生成和挖掘的生成式预训练转换器 |生物信息学简报 |牛津学术 --- BioGPT: generative pre-trained transformer for biomedical text generation and mining | Briefings in Bioinformatics | Oxford Academic

https://academic.oup.com/bib/article/23/6/bbac409/6713511

环境：

centos 7，anaconda3，CUDA 11.6

安装方法：

centos-LLM-生物信息-BioGPT安装-CSDN博客

https://blog.csdn.net/pxy7896/article/details/146982288

官方测试用例
- [使用hugging face](#使用hugging face)
- - 文本生成
报错处理
- [ModuleNotFoundError: No module named 'torch.distributed.tensor'](#ModuleNotFoundError: No module named 'torch.distributed.tensor')
- [A module that was compiled using NumPy 1.x cannot be run in NumPy 2.0.1](#A module that was compiled using NumPy 1.x cannot be run in NumPy 2.0.1)

BioGPT 是一个基于 GPT 架构的生物医学领域预训练语言模型，适用于生成生物医学文本或进行相关 NLP 任务。

官方测试用例

使用hugging face

文本生成

说明：

BioGptForCausalLM是GPT类型的因果语言模型，适用于：
- 文本生成：如问答、摘要
- 生物医学文本续写：如生成诊断报告
BioGptForCausalLM是基于Transformer的Decoder-only架构，参数量：1.5B（Large 版本）或 345M（Base 版本）

python 复制代码

import os
# 国内加速
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

from transformers import BioGptTokenizer, BioGptForCausalLM
from transformers import pipeline, set_seed
# 加载分词器tokenizer（1.将文本转化为Token IDs供模型理解 2.处理特殊标记）
tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")
# 加载模型（加载时会下载预训练权重）
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt")
'''
text = "Replace me by any text you'd like."
# 返回 PyTorch 张量. tf是返回tensorflow张量, np是返回NumPy数组
# Hugging Face 的 Transformers 库支持 PyTorch、TensorFlow 和 JAX，需明确指定张量格式
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input) # 
'''
# 创建文本生成的流水线，能自动处理文本的分词、模型调用和输出解码
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
# 设置随机种子，确保生成结果可以复现
set_seed(42)
# 要求生成5条不同的文本（需要生成越多越增加显存占用）
# 每条最大长度为20个token，启用随机采样（非确定性生成）
# 模型会基于概率分布随机选择下一个 token（温度参数默认为 1.0），因此每次调用结果可能不同
# 与 do_sample=False（贪心搜索）相比，结果更具多样性
# 如果需要控制随机性可以设置temperature，越小越保守
outputs = generator("COVID-19 is", max_length=20, num_return_sequences=5, do_sample=True)
# 打印结果
for i, output in enumerate(outputs):
	print(f"Result {i+1}: {output['generated_text']}")
'''
Result 1: COVID-19 is a disease that spreads worldwide and is currently found in a growing proportion of the population
Result 2: COVID-19 is one of the largest viral epidemics in the world.
Result 3: COVID-19 is a common condition affecting an estimated 1.1 million people in the United States alone.
Result 4: COVID-19 is a pandemic, the incidence has been increased in a manner similar to that in other
Result 5: COVID-19 is transmitted via droplets, air-borne, or airborne transmission.
'''

Beam-Search：

Beam-Search（束搜索）是一种用于序列生成（如文本生成、机器翻译）的搜索算法，比贪心搜索（Greedy Search）更高效且能生成更优结果。其核心思想是：在每一步保留 Top-K（K = num_beams）个最可能的候选序列，而不是只保留一个最优解（贪心策略）。

参数选择：

num_beams：越大结果越好，但是计算量也越。通常5是平衡点
early_stopping：当所有候选序列都达到结束标记时提前终止
min_length和max_length：控制生成的文本的token数量，防止过早结束或者太长
length_penalty：长度惩罚（ <1 鼓励短文本，>1 鼓励长文本）

python 复制代码

import torch
from transformers import BioGptTokenizer, BioGptForCausalLM, set_seed

tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt")

sentence = "COVID-19 is"
inputs = tokenizer(sentence, return_tensors="pt")

set_seed(42)

with torch.no_grad():
	# 生成一个包含生成的token IDs的张量
	beam_output = model.generate(**inputs, min_length=100, max_length=1024, num_beams=5, early_stopping=True)
# 解码
tokenizer.decode(beam_output[0], skip_special_tokens=True)
'''
COVID-19 is a global pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the causative agent of coronavirus disease 2019 (COVID-19), which has spread to more than 200 countries and territories, including the United States (US), Canada, Australia, New Zealand, the United Kingdom (UK), and the United States of America (USA), as of March 11, 2020, with more than 800,000 confirmed cases and more than 800,000 deaths.
'''

报错处理

ModuleNotFoundError: No module named 'torch.distributed.tensor'

完整报错：

shell 复制代码

>>> from transformers import pipeline, set_seed
Traceback (most recent call last):
  File "/home/xxx/anaconda3/envs/biogpt/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1967, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
  File "/home/xxx/anaconda3/envs/biogpt/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/xxx/anaconda3/envs/biogpt/lib/python3.10/site-packages/transformers/pipelines/__init__.py", line 49, in <module>
    from .audio_classification import AudioClassificationPipeline
  File "/home/xxx/anaconda3/envs/biogpt/lib/python3.10/site-packages/transformers/pipelines/audio_classification.py", line 21, in <module>
    from .base import Pipeline, build_pipeline_init_args
  File "/home/xxx/anaconda3/envs/biogpt/lib/python3.10/site-packages/transformers/pipelines/base.py", line 69, in <module>
    from ..modeling_utils import PreTrainedModel
  File "/home/xxx/anaconda3/envs/biogpt/lib/python3.10/site-packages/transformers/modeling_utils.py", line 41, in <module>
    import torch.distributed.tensor
ModuleNotFoundError: No module named 'torch.distributed.tensor'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<frozen importlib._bootstrap>", line 1075, in _handle_fromlist
  File "/home/xxx/anaconda3/envs/biogpt/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1955, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/home/xxx/anaconda3/envs/biogpt/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1969, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import transformers.pipelines because of the following error (look up to see its traceback):
No module named 'torch.distributed.tensor'

已知torch.distributed.tensor 模块在 PyTorch 1.10+ 才引入，但我的 PyTorch 版本是1.12.0，考虑是transformers版本冲突，所以降级到4.28.1。

shell 复制代码

pip install transformers==4.28.1 -i https://pypi.tuna.tsinghua.edu.cn/simple

验证：

shell 复制代码

>>> import torch
>>> print(torch.__version__)
1.12.0
>>> print(torch.distributed.is_available()) 
True

A module that was compiled using NumPy 1.x cannot be run in NumPy 2.0.1

完整报错：

shell 复制代码

>>> model = BioGptForCausalLM.from_pretrained("microsoft/biogpt")

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<stdin>", line 1, in <module>
  File "/home/xxx/anaconda3/envs/biogpt/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2560, in from_pretrained
    state_dict = load_state_dict(resolved_archive_file)
  File "/home/xxx/anaconda3/envs/biogpt/lib/python3.10/site-packages/transformers/modeling_utils.py", line 442, in load_state_dict
    return torch.load(checkpoint_file, map_location="cpu")
  File "/home/xxx/anaconda3/envs/biogpt/lib/python3.10/site-packages/torch/serialization.py", line 712, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/home/xxx/anaconda3/envs/biogpt/lib/python3.10/site-packages/torch/serialization.py", line 1049, in _load
    result = unpickler.load()
  File "/home/xxx/anaconda3/envs/biogpt/lib/python3.10/site-packages/torch/_utils.py", line 138, in _rebuild_tensor_v2
    tensor = _rebuild_tensor(storage, storage_offset, size, stride)
  File "/home/xxx/anaconda3/envs/biogpt/lib/python3.10/site-packages/torch/_utils.py", line 133, in _rebuild_tensor
    t = torch.tensor([], dtype=storage.dtype, device=storage._untyped().device)
/home/xxx/anaconda3/envs/biogpt/lib/python3.10/site-packages/torch/_utils.py:133: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at  /opt/conda/conda-bld/pytorch_1656352645774/work/torch/csrc/utils/tensor_numpy.cpp:68.)
  t = torch.tensor([], dtype=storage.dtype, device=storage._untyped().device)

错误原因

错误日志中的 Failed to initialize NumPy: _ARRAY_API not found 表明 PyTorch 正在尝试调用 NumPy 1.x 的 API，但当前环境是 NumPy 2.0
许多科学计算库（如 PyTorch、HuggingFace Transformers）在发布时是基于 NumPy 1.x 编译的
NumPy 2.0 修改了 ABI（应用程序二进制接口），导致旧版编译的扩展模块无法直接运行

解决方案

降级到比较低的版本。

shell 复制代码

pip install "numpy>=1.21,<2" -i https://pypi.tuna.tsinghua.edu.cn/simple

centos-LLM-生物信息-BioGPT-使用1

目录

官方测试用例

使用hugging face

文本生成

报错处理

ModuleNotFoundError: No module named 'torch.distributed.tensor'

A module that was compiled using NumPy 1.x cannot be run in NumPy 2.0.1