报错
安装Qwen大模型,一直报这个错误无法解决,最神奇的是有的4090相同代码正常运行,有的4090配置环境就直接报错。具体报错如下:
pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:110: assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion
probability tensor contains either
inf,
nanor element < 0
failed.
Traceback (most recent call last):
File "/home/yida/PyCharmProject/My_Paper/CoT-RLFuse/Model/Architecture/clip_model.py", line 676, in
y, , extracted_batch_output = net(vis, ir)File "/home/yida/miniconda3/envs/Omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/yida/miniconda3/envs/Omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/home/yida/PyCharmProject/My_Paper/CoT-RLFuse/Model/Architecture/clip_model.py", line 627, in forward
extracted_output = batch_generate_and_extract_features(
File "/home/yida/PyCharmProject/My_Paper/CoT-RLFuse/Model/Architecture/clip_model.py", line 171, in batch_generate_and_extract_features
outputs = model.generate(
File "/home/yida/miniconda3/envs/Omni/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/yida/miniconda3/envs/Omni/lib/python3.10/site-packages/transformers/generation/utils.py", line 2315, in generate
result = self._sample(
File "/home/yida/miniconda3/envs/Omni/lib/python3.10/site-packages/transformers/generation/utils.py", line 3349, in _sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: CUDA error: device-side assert triggered
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.
官方issue上给出的解决办法:仍然无法解决我的问题。
官方issue:https://github.com/QwenLM/Qwen2.5-VL/issues/1033
解决办法1
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
args.model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="flash_attention_2"
)
解决办法2
bash
pip install transformers"<4.50.0"
解决办法3
降级或者升级Pytorch版本到2.6.0
发现问题
通过一些列版本切换都无法解决我的困境,最终将问题定位到是下载的模型异常。 由于国内无法有效下载HuggingFace上的模型权重,因此大部分都是通过设置HuggingFace的镜像来下载的模型权重。
问题就是出在镜像网站[export HF_ENDPOINT=https://hf-mirror.com]。我基于镜像下载有时会成功,但是有时会失败,时不时卡在进度99%不动,我估计是它网站上对模型权重的校验有bug或者网络连接不稳定,导致下载的模型权重异常,出现这样的报错。
probability tensor contains either `inf`, `nan` or element < 0
解决办法
重新下载模型权重,既然hf-mirror.com镜像网站有问题就换一种方法用modelscope下载。
以模型Qwen2.5-VL-3B-Instruct为例:
https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct
bash
# 下载魔塔库
pip install modelscope
# 下载模型
modelscope download --model Qwen/Qwen2.5-VL-3B-Instruct
# 指定目录下载到./dir
# modelscope download --model Qwen/Qwen2.5-VL-3B-Instruct README.md --local_dir ./dir
# 默认会把模型权重下载到
~/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct
# 修改加载模型权重的路径
~/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct
重新加载模型,即可解决我的问题,正常运行不在报错。记得修改新的路径!