【异常错误】MDT错误:CUDA error: device-side assert triggered

在MDT 运行的时候出现了一个问题:

Traceback (most recent call last):

File "/scratch/users/czakka/MDT/scripts/image_train.py", line 100, in <module>

main()

File "/scratch/users/czakka/MDT/scripts/image_train.py", line 46, in main

TrainLoop(

File "/scratch/users/czakka/MDT/masked_diffusion/train_util.py", line 178, in run_loop

self.run_step(batch, cond)

File "/scratch/users/czakka/MDT/masked_diffusion/train_util.py", line 192, in run_step

self.forward_backward(batch, cond)

File "/scratch/users/czakka/MDT/masked_diffusion/train_util.py", line 231, in forward_backward

losses = compute_losses()

File "/scratch/users/czakka/MDT/masked_diffusion/respace.py", line 97, in training_losses

return super().training_losses(self._wrap_model(model), *args, **kwargs)

File "/scratch/users/czakka/MDT/masked_diffusion/gaussian_diffusion.py", line 747, in training_losses

model_output = model(x_t, t, **model_kwargs)

File "/scratch/users/czakka/MDT/masked_diffusion/respace.py", line 128, in call

return self.model(x, new_ts, **kwargs)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

return forward_call(*args, **kwargs)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward

output = self._run_ddp_forward(*inputs, **kwargs)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward

return module_to_run(*inputs0, **kwargs0) # type: ignoreindex

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

return forward_call(*args, **kwargs)

File "/scratch/users/czakka/MDT/masked_diffusion/models.py", line 444, in forward

y = self.y_embedder(y, self.training) # (N, D)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

return forward_call(*args, **kwargs)

File "/scratch/users/czakka/MDT/masked_diffusion/models.py", line 191, in forward

embeddings = self.embedding_table(labels)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

return forward_call(*args, **kwargs)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 162, in forward

return F.embedding(

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding

return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error'

需要将每个文件夹下的每个文件重命名为"文件夹名_图像名"的形式才可以

CUDA Assert Error · Issue #2 · sail-sg/MDT · GitHub

相关推荐
ybdesire几秒前
微调LLM提升工具调用能力的ShareGPT数据格式
运维·服务器·人工智能·大模型·微调
番茄育学园3 分钟前
2026 AI图表工具实测:我筛选了5款,帮你绕开做图表的那些坑
人工智能
大模型任我行5 分钟前
百度:渐进多令牌预测加速文档解析
人工智能·语言模型·自然语言处理·论文笔记
冰暮流星5 分钟前
python之flask框架讲解-准备
开发语言·python·flask
Chef_Chen8 分钟前
论文解读:AgentCoder让编程Agent先过测试再交付
人工智能·agent
2601_9549711314 分钟前
人工智能与大数据专业填报指南:核心区别、职业路径
大数据·人工智能
Am-Chestnuts17 分钟前
AI 公式复制到 Word 乱码怎么办:LaTeX 转 Word 与 DS随心转方案对比
人工智能·word
菜鸟是大神19 分钟前
【Hermes入门11讲】第七讲:定时自动化——让Hermes成为你的24小时助手
人工智能·github·hermes
特立独行的猫a1 小时前
Kimi 智能助手核心应用场景与落地指南
人工智能·自动化·智能助手·kimi·ai落地场景
newbe365242 小时前
我们如何使用 impeccable 优化前端界面设计与实现稳定性
前端·人工智能·分布式·github·aigc·wpf