【异常错误】MDT错误:CUDA error: device-side assert triggered

在MDT 运行的时候出现了一个问题:

Traceback (most recent call last):

File "/scratch/users/czakka/MDT/scripts/image_train.py", line 100, in <module>

main()

File "/scratch/users/czakka/MDT/scripts/image_train.py", line 46, in main

TrainLoop(

File "/scratch/users/czakka/MDT/masked_diffusion/train_util.py", line 178, in run_loop

self.run_step(batch, cond)

File "/scratch/users/czakka/MDT/masked_diffusion/train_util.py", line 192, in run_step

self.forward_backward(batch, cond)

File "/scratch/users/czakka/MDT/masked_diffusion/train_util.py", line 231, in forward_backward

losses = compute_losses()

File "/scratch/users/czakka/MDT/masked_diffusion/respace.py", line 97, in training_losses

return super().training_losses(self._wrap_model(model), *args, **kwargs)

File "/scratch/users/czakka/MDT/masked_diffusion/gaussian_diffusion.py", line 747, in training_losses

model_output = model(x_t, t, **model_kwargs)

File "/scratch/users/czakka/MDT/masked_diffusion/respace.py", line 128, in call

return self.model(x, new_ts, **kwargs)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

return forward_call(*args, **kwargs)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward

output = self._run_ddp_forward(*inputs, **kwargs)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward

return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index]

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

return forward_call(*args, **kwargs)

File "/scratch/users/czakka/MDT/masked_diffusion/models.py", line 444, in forward

y = self.y_embedder(y, self.training) # (N, D)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

return forward_call(*args, **kwargs)

File "/scratch/users/czakka/MDT/masked_diffusion/models.py", line 191, in forward

embeddings = self.embedding_table(labels)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

return forward_call(*args, **kwargs)

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 162, in forward

return F.embedding(

File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding

return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error'

需要将每个文件夹下的每个文件重命名为"文件夹名_图像名"的形式才可以

CUDA Assert Error · Issue #2 · sail-sg/MDT · GitHub

相关推荐
陈鋆11 分钟前
智慧城市初探与解决方案
人工智能·智慧城市
qdprobot11 分钟前
ESP32桌面天气摆件加文心一言AI大模型对话Mixly图形化编程STEAM创客教育
网络·人工智能·百度·文心一言·arduino
QQ395753323712 分钟前
金融量化交易模型的突破与前景分析
人工智能·金融
QQ395753323712 分钟前
金融量化交易:技术突破与模型优化
人工智能·金融
The_Ticker25 分钟前
CFD平台如何接入实时行情源
java·大数据·数据库·人工智能·算法·区块链·软件工程
Elastic 中国社区官方博客31 分钟前
Elasticsearch 开放推理 API 增加了对 IBM watsonx.ai Slate 嵌入模型的支持
大数据·数据库·人工智能·elasticsearch·搜索引擎·ai·全文检索
jwolf231 分钟前
摸一下elasticsearch8的AI能力:语义搜索/vector向量搜索案例
人工智能·搜索引擎
有Li40 分钟前
跨视角差异-依赖网络用于体积医学图像分割|文献速递-生成式模型与transformer在医学影像中的应用
人工智能·计算机视觉
傻啦嘿哟43 分钟前
如何使用 Python 开发一个简单的文本数据转换为 Excel 工具
开发语言·python·excel
B站计算机毕业设计超人1 小时前
计算机毕业设计SparkStreaming+Kafka旅游推荐系统 旅游景点客流量预测 旅游可视化 旅游大数据 Hive数据仓库 机器学习 深度学习
大数据·数据仓库·hadoop·python·kafka·课程设计·数据可视化