【异常错误】MDT错误:CUDA error: device-side assert triggered

在MDT 运行的时候出现了一个问题:

Traceback (most recent call last):

File "/scratch/users/czakka/MDT/scripts/image_train.py", line 100, in <module>

main()

File "/scratch/users/czakka/MDT/scripts/image_train.py", line 46, in main

TrainLoop(

File "/scratch/users/czakka/MDT/masked_diffusion/train_util.py", line 178, in run_loop

self.run_step(batch, cond)

File "/scratch/users/czakka/MDT/masked_diffusion/train_util.py", line 192, in run_step

self.forward_backward(batch, cond)

File "/scratch/users/czakka/MDT/masked_diffusion/train_util.py", line 231, in forward_backward

losses = compute_losses()

File "/scratch/users/czakka/MDT/masked_diffusion/respace.py", line 97, in training_losses

return super().training_losses(self._wrap_model(model), *args, **kwargs)

File "/scratch/users/czakka/MDT/masked_diffusion/gaussian_diffusion.py", line 747, in training_losses

model_output = model(x_t, t, **model_kwargs)

File "/scratch/users/czakka/MDT/masked_diffusion/respace.py", line 128, in call

return self.model(x, new_ts, **kwargs)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

return forward_call(*args, **kwargs)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward

output = self._run_ddp_forward(*inputs, **kwargs)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward

return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index]

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

return forward_call(*args, **kwargs)

File "/scratch/users/czakka/MDT/masked_diffusion/models.py", line 444, in forward

y = self.y_embedder(y, self.training) # (N, D)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

return forward_call(*args, **kwargs)

File "/scratch/users/czakka/MDT/masked_diffusion/models.py", line 191, in forward

embeddings = self.embedding_table(labels)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

return forward_call(*args, **kwargs)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 162, in forward

return F.embedding(

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding

return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error'

需要将每个文件夹下的每个文件重命名为"文件夹名_图像名"的形式才可以

CUDA Assert Error · Issue #2 · sail-sg/MDT · GitHub

相关推荐
sali-tec4 小时前
C# 基于halcon的视觉工作流-章66 四目匹配
开发语言·人工智能·数码相机·算法·计算机视觉·c#
这张生成的图像能检测吗4 小时前
(论文速读)ParaDiffusion:基于信息扩散模型的段落到图像生成
人工智能·机器学习·计算机视觉·文生图·图像生成·视觉语言模型
新程记4 小时前
2025年,上海CAIE认证报考指南:把握AI机遇的实用起点
人工智能·百度
unicrom_深圳市由你创科技4 小时前
汽修AI智能体V1.0——从模型微调到应用部署
人工智能
路边草随风4 小时前
milvus向量数据库使用尝试
人工智能·python·milvus
irizhao4 小时前
基于深度学习的智能停车场系统设计与实现
人工智能·深度学习
newobut4 小时前
vscode远程调试python程序,基于debugpy库
vscode·python·调试·debugpy
APIshop5 小时前
用 Python 把“API 接口”当数据源——从找口子到落库的全流程实战
开发语言·python
九河云5 小时前
华为云 ECS 弹性伸缩技术:应对业务峰值的算力动态调度策略
大数据·服务器·人工智能·物联网·华为云
一点晖光6 小时前
Docker 作图咒语生成器搭建指南
python·docker