W1206 18:53:57.353000 18802 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18835 closing signal SIGTERM
W1206 18:53:57.354000 18802 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18836 closing signal SIGTERM
W1206 18:53:57.354000 18802 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18837 closing signal SIGTERM
W1206 18:53:57.354000 18802 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18839 closing signal SIGTERM
W1206 18:53:57.354000 18802 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18840 closing signal SIGTERM
W1206 18:53:57.354000 18802 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18841 closing signal SIGTERM
W1206 18:53:57.355000 18802 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18842 closing signal SIGTERM
E1206 18:53:57.387000 18802 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 3 (pid: 18838) of binary: /home/chacha/anaconda3/envs/drive/bin/python
Traceback (most recent call last):
File "/home/chacha/anaconda3/envs/drive/bin/accelerate", line 8, in
sys.exit(main())
File "/home/chacha/anaconda3/envs/drive/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/chacha/anaconda3/envs/drive/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args)
File "/home/chacha/anaconda3/envs/drive/lib/python3.10/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/home/chacha/anaconda3/envs/drive/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/chacha/anaconda3/envs/drive/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/chacha/anaconda3/envs/drive/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/drive/DriveDreamer-main/DriveDreamer-main/dreamer-train/dreamer_train/distributed/run_task.py FAILED
------------------------------------------------------------
Failures:
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-12-06_18:53:57
host : DESKTOP-EJP3C2O.localdomain
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 18838)
error_file:
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Traceback (most recent call last):
File "/home/drive/DriveDreamer-main/DriveDreamer-main/./dreamer-train/projects/launch.py", line 35, in
main()
File "/home/drive/DriveDreamer-main/DriveDreamer-main/./dreamer-train/projects/launch.py", line 32, in main
launch_from_config(config_path, ','.join(opts.runners))
File "/home/drive/DriveDreamer-main/DriveDreamer-main/dreamer-train/dreamer_train/distributed/launch.py", line 175, in launch_from_config
launcher.launch('{} --config {} --runners {}'.format(file_path, config_path, runners))
File "/home/drive/DriveDreamer-main/DriveDreamer-main/dreamer-train/dreamer_train/distributed/launch.py", line 159, in launch
os.remove(self.hostfile_path)
FileNotFoundError: [Errno 2] No such file or directory: '_tmp/2024-12-06-185351_hostfile'
Код взят с GitHub: https://github.com/JeffWang987/DriveDreamer
При запуске кода этот код создаст файл конфигурации и файл хост-файла с метками времени. , но хост-файл не может быть создан правильно.
[code]W1206 18:53:57.353000 18802 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18835 closing signal SIGTERM W1206 18:53:57.354000 18802 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18836 closing signal SIGTERM W1206 18:53:57.354000 18802 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18837 closing signal SIGTERM W1206 18:53:57.354000 18802 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18839 closing signal SIGTERM W1206 18:53:57.354000 18802 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18840 closing signal SIGTERM W1206 18:53:57.354000 18802 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18841 closing signal SIGTERM W1206 18:53:57.355000 18802 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18842 closing signal SIGTERM E1206 18:53:57.387000 18802 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 3 (pid: 18838) of binary: /home/chacha/anaconda3/envs/drive/bin/python Traceback (most recent call last): File "/home/chacha/anaconda3/envs/drive/bin/accelerate", line 8, in sys.exit(main()) File "/home/chacha/anaconda3/envs/drive/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/home/chacha/anaconda3/envs/drive/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1159, in launch_command multi_gpu_launcher(args) File "/home/chacha/anaconda3/envs/drive/lib/python3.10/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher distrib_run.run(args) File "/home/chacha/anaconda3/envs/drive/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/chacha/anaconda3/envs/drive/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/chacha/anaconda3/envs/drive/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /home/drive/DriveDreamer-main/DriveDreamer-main/dreamer-train/dreamer_train/distributed/run_task.py FAILED ------------------------------------------------------------ Failures:
------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-12-06_18:53:57 host : DESKTOP-EJP3C2O.localdomain rank : 3 (local_rank: 3) exitcode : 1 (pid: 18838) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ Traceback (most recent call last): File "/home/drive/DriveDreamer-main/DriveDreamer-main/./dreamer-train/projects/launch.py", line 35, in main() File "/home/drive/DriveDreamer-main/DriveDreamer-main/./dreamer-train/projects/launch.py", line 32, in main launch_from_config(config_path, ','.join(opts.runners)) File "/home/drive/DriveDreamer-main/DriveDreamer-main/dreamer-train/dreamer_train/distributed/launch.py", line 175, in launch_from_config launcher.launch('{} --config {} --runners {}'.format(file_path, config_path, runners)) File "/home/drive/DriveDreamer-main/DriveDreamer-main/dreamer-train/dreamer_train/distributed/launch.py", line 159, in launch os.remove(self.hostfile_path) FileNotFoundError: [Errno 2] No such file or directory: '_tmp/2024-12-06-185351_hostfile' [/code] Код взят с GitHub: https://github.com/JeffWang987/DriveDreamer При запуске кода этот код создаст файл конфигурации и файл хост-файла с метками времени. , но хост-файл не может быть создан правильно.