【异常错误】MDT错误:CUDA error: device-side assert triggered

在MDT 运行的时候出现了一个问题:

Traceback (most recent call last):

File "/scratch/users/czakka/MDT/scripts/image_train.py", line 100, in <module>

main()

File "/scratch/users/czakka/MDT/scripts/image_train.py", line 46, in main

TrainLoop(

File "/scratch/users/czakka/MDT/masked_diffusion/train_util.py", line 178, in run_loop

self.run_step(batch, cond)

File "/scratch/users/czakka/MDT/masked_diffusion/train_util.py", line 192, in run_step

self.forward_backward(batch, cond)

File "/scratch/users/czakka/MDT/masked_diffusion/train_util.py", line 231, in forward_backward

losses = compute_losses()

File "/scratch/users/czakka/MDT/masked_diffusion/respace.py", line 97, in training_losses

return super().training_losses(self._wrap_model(model), *args, **kwargs)

File "/scratch/users/czakka/MDT/masked_diffusion/gaussian_diffusion.py", line 747, in training_losses

model_output = model(x_t, t, **model_kwargs)

File "/scratch/users/czakka/MDT/masked_diffusion/respace.py", line 128, in call

return self.model(x, new_ts, **kwargs)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

return forward_call(*args, **kwargs)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward

output = self._run_ddp_forward(*inputs, **kwargs)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward

return module_to_run(*inputs0, **kwargs0) # type: ignoreindex

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

return forward_call(*args, **kwargs)

File "/scratch/users/czakka/MDT/masked_diffusion/models.py", line 444, in forward

y = self.y_embedder(y, self.training) # (N, D)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

return forward_call(*args, **kwargs)

File "/scratch/users/czakka/MDT/masked_diffusion/models.py", line 191, in forward

embeddings = self.embedding_table(labels)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

return forward_call(*args, **kwargs)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 162, in forward

return F.embedding(

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding

return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error'

需要将每个文件夹下的每个文件重命名为"文件夹名_图像名"的形式才可以

CUDA Assert Error · Issue #2 · sail-sg/MDT · GitHub

相关推荐
小小测试开发4 小时前
安装 Python 3.10+
开发语言·人工智能·python
KaMeidebaby5 小时前
卡梅德生物技术快报|PD1 单克隆抗体定制配套 N 糖全谱质控开发
前端·人工智能·算法·数据挖掘·数据分析
梦想不只是梦与想5 小时前
Python 中的装饰器
python·装饰器
我叫唧唧波5 小时前
Python+AI 全栈学习笔记
人工智能·python·学习
哈哈,柳暗花明6 小时前
人工智能专业术语详解(E)
人工智能·专业术语
copyer_xyf6 小时前
Python 异常处理
前端·后端·python
AI极客菌6 小时前
AI绘画工具中,为什么专业玩家爱用Stable Diffusion,普通玩家却喜欢Midjourney?
大数据·人工智能·ai·ai作画·stable diffusion·aigc·midjourney
人工智能AI技术6 小时前
FLUX.2[klein]开源!小香蕉平替,本地部署AI绘画的极简方案
人工智能·ai作画·aigc
腾视科技AI6 小时前
腾视科技大模型一体机解决方案:低成本私有化落地,重塑行业智能应用新格局
大数据·人工智能·科技·ai·边缘计算·算力·ai算力
pusheng20256 小时前
IFSJ全英文专访:中国创新力量重塑先进气体感知技术,赋能全球关键基础设施安全
前端·网络·人工智能·物联网·安全