在MDT 运行的时候出现了一个问题:
Traceback (most recent call last):
File "/scratch/users/czakka/MDT/scripts/image_train.py", line 100, in <module>
main()
File "/scratch/users/czakka/MDT/scripts/image_train.py", line 46, in main
TrainLoop(
File "/scratch/users/czakka/MDT/masked_diffusion/train_util.py", line 178, in run_loop
self.run_step(batch, cond)
File "/scratch/users/czakka/MDT/masked_diffusion/train_util.py", line 192, in run_step
self.forward_backward(batch, cond)
File "/scratch/users/czakka/MDT/masked_diffusion/train_util.py", line 231, in forward_backward
losses = compute_losses()
File "/scratch/users/czakka/MDT/masked_diffusion/respace.py", line 97, in training_losses
return super().training_losses(self._wrap_model(model), *args, **kwargs)
File "/scratch/users/czakka/MDT/masked_diffusion/gaussian_diffusion.py", line 747, in training_losses
model_output = model(x_t, t, **model_kwargs)
File "/scratch/users/czakka/MDT/masked_diffusion/respace.py", line 128, in call
return self.model(x, new_ts, **kwargs)
File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index]
File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/scratch/users/czakka/MDT/masked_diffusion/models.py", line 444, in forward
y = self.y_embedder(y, self.training) # (N, D)
File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/scratch/users/czakka/MDT/masked_diffusion/models.py", line 191, in forward
embeddings = self.embedding_table(labels)
File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 162, in forward
return F.embedding(
File "/home/users/czakka/.local/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
terminate called after throwing an instance of 'c10::Error'
需要将每个文件夹下的每个文件重命名为"文件夹名_图像名"的形式才可以