torch.distributed.launch 、 torchrun 和 torch.distributed.run 无法与 nohup 兼容

问题现象:

使用nohup 启动torch的分布式训练后, 由于ssh断开与服务器的连接, 导致训练过程出错:

bash 复制代码
WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3971878 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3971879 closing signal SIGHUP
Traceback (most recent call last):
  File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 237, in launch_agent
    result = agent.run()
  File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
    result = self._invoke_run(role)
  File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 850, in _invoke_run
    time.sleep(monitor_interval)
  File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 3971841 got signal: 1

执行的命令如下:

bash 复制代码
nohup ./my_train.sh   >log.log 2>&1   &

报错的原因可能是torch.distributed.launch 、 torchrun 和 torch.distributed.run 无法与 nohup 兼容 , 当ssh连接断开, 窗口被关闭时,torch.distribute 接管了相关异常, 导致nohup没起作用。

ref: https://discuss.pytorch.org/t/ddp-error-torch-distributed-elastic-agent-server-api-received-1-death-signal-shutting-down-workers/135720/6

相关推荐
Phoenixtree_DongZhao14 分钟前
ICLM 2025 Time Series 时间序列论文汇总(论文链接)
人工智能·时间序列
eve杭22 分钟前
网络安全细则[特殊字符]
大数据·人工智能·5g·网络安全
图学习的小张1 小时前
Windows安装mamba全流程(全网最稳定最成功)
人工智能·windows·深度学习·语言模型
编程指南针1 小时前
2026新选题-基于Python的老年病医疗数据分析系统的设计与实现(数据采集+可视化分析)
开发语言·python·病历分析·医疗病历分析
lisw051 小时前
数据科学与AI的未来就业前景如何?
人工智能·机器学习·软件工程
索西引擎1 小时前
AI 智能体的运行模式
人工智能·ai智能体
reasonsummer1 小时前
【办公类-117-01】20250924通义万相视频2.5——三个小人(幼儿作品动态化)
人工智能·音视频·通义万相
常州晟凯电子科技2 小时前
君正T32开发笔记之固件烧写
人工智能·笔记·嵌入式硬件·物联网
元宇宙时间2 小时前
SYN VISION韩国发布会:获评非小号Alpha,战略合作PrompTale
人工智能·web3·区块链
王哥儿聊AI2 小时前
告别人工出题!PromptCoT 2.0 让大模型自己造训练难题,7B 模型仅用合成数据碾压人工数据集效果!
人工智能·深度学习·算法·机器学习·软件工程