probability tensor contains either `inf`, `nan` or element < 0

报错

安装Qwen大模型,一直报这个错误无法解决,最神奇的是有的4090相同代码正常运行,有的4090配置环境就直接报错。具体报错如下:

pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:110: assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion probability tensor contains either inf, nan or element < 0 failed.
Traceback (most recent call last):
File "/home/yida/PyCharmProject/My_Paper/CoT-RLFuse/Model/Architecture/clip_model.py", line 676, in
y,
, extracted_batch_output = net(vis, ir)

File "/home/yida/miniconda3/envs/Omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl

return self._call_impl(*args, **kwargs)

File "/home/yida/miniconda3/envs/Omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl

return forward_call(*args, **kwargs)

File "/home/yida/PyCharmProject/My_Paper/CoT-RLFuse/Model/Architecture/clip_model.py", line 627, in forward

extracted_output = batch_generate_and_extract_features(

File "/home/yida/PyCharmProject/My_Paper/CoT-RLFuse/Model/Architecture/clip_model.py", line 171, in batch_generate_and_extract_features

outputs = model.generate(

File "/home/yida/miniconda3/envs/Omni/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context

return func(*args, **kwargs)

File "/home/yida/miniconda3/envs/Omni/lib/python3.10/site-packages/transformers/generation/utils.py", line 2315, in generate

result = self._sample(

File "/home/yida/miniconda3/envs/Omni/lib/python3.10/site-packages/transformers/generation/utils.py", line 3349, in _sample

next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)

RuntimeError: CUDA error: device-side assert triggered

Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.


官方issue上给出的解决办法:仍然无法解决我的问题。

官方issue:https://github.com/QwenLM/Qwen2.5-VL/issues/1033

解决办法1

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(

args.model_path,

torch_dtype=torch.bfloat16,

device_map="auto",

attn_implementation="flash_attention_2"

)

解决办法2

bash 复制代码
pip install transformers"<4.50.0"

解决办法3

降级或者升级Pytorch版本到2.6.0


发现问题

通过一些列版本切换都无法解决我的困境,最终将问题定位到是下载的模型异常。 由于国内无法有效下载HuggingFace上的模型权重,因此大部分都是通过设置HuggingFace的镜像来下载的模型权重。

问题就是出在镜像网站[export HF_ENDPOINT=https://hf-mirror.com]。我基于镜像下载有时会成功,但是有时会失败,时不时卡在进度99%不动,我估计是它网站上对模型权重的校验有bug或者网络连接不稳定,导致下载的模型权重异常,出现这样的报错。

probability tensor contains either `inf`, `nan` or element < 0


解决办法

重新下载模型权重,既然hf-mirror.com镜像网站有问题就换一种方法用modelscope下载。

以模型Qwen2.5-VL-3B-Instruct为例:
https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct

bash 复制代码
# 下载魔塔库
pip install modelscope

# 下载模型
modelscope download --model Qwen/Qwen2.5-VL-3B-Instruct

# 指定目录下载到./dir
# modelscope download --model Qwen/Qwen2.5-VL-3B-Instruct README.md --local_dir ./dir

# 默认会把模型权重下载到
~/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct

# 修改加载模型权重的路径
~/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct

重新加载模型,即可解决我的问题,正常运行不在报错。记得修改新的路径!


相关推荐
秋意零15 天前
【排坑指南】MySQL初始化后,Nacos与微服务无法连接??
运维·数据库·mysql·微服务·nacos·报错
这是一个懒人19 天前
SD和comfyui常用模型介绍和下载
stable diffusion·comfyui·模型下载
FelicityW2 个月前
报错:函数或变量 ‘calcmie‘ 无法识别。
matlab·报错
m0_564264182 个月前
springboot项目之websocket的坑:spring整合websocket后进行单元测试后报错的解决方案
java·经验分享·spring boot·websocket·spring·单元测试·报错
LLLL965 个月前
docker pull 报错Get “https://registry-1.docker.io/v2/“: net/http: request canceled while waiting for c
docker·报错
于指尖飞舞5 个月前
在vue3中使用datav完整引入时卡在加载页面的解决方法
vue3·报错·datav
伊织code6 个月前
[报错] Dify - 踩坑笔记
flask·api·报错·踩坑·dify·poetry·opendal
dzj20216 个月前
Unity发布android Pico报错——CommandInvokationFailure: Gradle build failed踩坑记录
android·unity·gradle·报错·pico
浮梦终焉6 个月前
_使用CLion的Vcpkg安装SDL2,添加至CMakelists时报错,编译报错
报错·c/c++·sdl2