【异常错误】MDT错误:CUDA error: device-side assert triggered

在MDT 运行的时候出现了一个问题:

Traceback (most recent call last):

File "/scratch/users/czakka/MDT/scripts/image_train.py", line 100, in <module>

main()

File "/scratch/users/czakka/MDT/scripts/image_train.py", line 46, in main

TrainLoop(

File "/scratch/users/czakka/MDT/masked_diffusion/train_util.py", line 178, in run_loop

self.run_step(batch, cond)

File "/scratch/users/czakka/MDT/masked_diffusion/train_util.py", line 192, in run_step

self.forward_backward(batch, cond)

File "/scratch/users/czakka/MDT/masked_diffusion/train_util.py", line 231, in forward_backward

losses = compute_losses()

File "/scratch/users/czakka/MDT/masked_diffusion/respace.py", line 97, in training_losses

return super().training_losses(self._wrap_model(model), *args, **kwargs)

File "/scratch/users/czakka/MDT/masked_diffusion/gaussian_diffusion.py", line 747, in training_losses

model_output = model(x_t, t, **model_kwargs)

File "/scratch/users/czakka/MDT/masked_diffusion/respace.py", line 128, in call

return self.model(x, new_ts, **kwargs)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

return forward_call(*args, **kwargs)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward

output = self._run_ddp_forward(*inputs, **kwargs)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward

return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index]

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

return forward_call(*args, **kwargs)

File "/scratch/users/czakka/MDT/masked_diffusion/models.py", line 444, in forward

y = self.y_embedder(y, self.training) # (N, D)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

return forward_call(*args, **kwargs)

File "/scratch/users/czakka/MDT/masked_diffusion/models.py", line 191, in forward

embeddings = self.embedding_table(labels)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

return forward_call(*args, **kwargs)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 162, in forward

return F.embedding(

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding

return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error'

需要将每个文件夹下的每个文件重命名为"文件夹名_图像名"的形式才可以

CUDA Assert Error · Issue #2 · sail-sg/MDT · GitHub

相关推荐
白雪讲堂13 分钟前
AI搜索品牌曝光资料包(精准适配文心一言/Kimi/DeepSeek等场景)
大数据·人工智能·搜索引擎·ai·文心一言·deepseek
The Future is mine17 分钟前
Python计算经纬度两点之间距离
开发语言·python
斯汤雷19 分钟前
Matlab绘图案例,设置图片大小,坐标轴比例为黄金比
数据库·人工智能·算法·matlab·信息可视化
九月镇灵将19 分钟前
GitPython库快速应用入门
git·python·gitpython
ejinxian25 分钟前
Spring AI Alibaba 快速开发生成式 Java AI 应用
java·人工智能·spring
葡萄成熟时_30 分钟前
【第十三届“泰迪杯”数据挖掘挑战赛】【2025泰迪杯】【代码篇】A题解题全流程(持续更新)
人工智能·数据挖掘
机器之心44 分钟前
一篇论文,看见百度广告推荐系统在大模型时代的革新
人工智能
机器之心1 小时前
视觉SSL终于追上了CLIP!Yann LeCun、谢赛宁等新作,逆转VQA任务固有认知
人工智能