Код: Выделить всё
W1206 18:53:57.353000 18802 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18835 closing signal SIGTERM
W1206 18:53:57.354000 18802 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18836 closing signal SIGTERM
W1206 18:53:57.354000 18802 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18837 closing signal SIGTERM
W1206 18:53:57.354000 18802 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18839 closing signal SIGTERM
W1206 18:53:57.354000 18802 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18840 closing signal SIGTERM
W1206 18:53:57.354000 18802 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18841 closing signal SIGTERM
W1206 18:53:57.355000 18802 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 18842 closing signal SIGTERM
E1206 18:53:57.387000 18802 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 3 (pid: 18838) of binary: /home/chacha/anaconda3/envs/drive/bin/python
Traceback (most recent call last):
File "/home/chacha/anaconda3/envs/drive/bin/accelerate", line 8, in
sys.exit(main())
File "/home/chacha/anaconda3/envs/drive/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/chacha/anaconda3/envs/drive/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args)
File "/home/chacha/anaconda3/envs/drive/lib/python3.10/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/home/chacha/anaconda3/envs/drive/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/chacha/anaconda3/envs/drive/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/chacha/anaconda3/envs/drive/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/home/drive/DriveDreamer-main/DriveDreamer-main/dreamer-train/dreamer_train/distributed/run_task.py FAILED
Основная причина (первый обнаруженный сбой):
[0]:
время: 2024-12-06_18:53:57
хост: DESKTOP-EJP3C2O.localdomain
ранг: 3 (local_rank: 3)
код выхода: 1 (pid: 18838)
файл_ошибки:
traceback: Чтобы включить обратную трассировку, см.: https://pytorch.org/docs/ стабильный/эластичный/errors.html
Код: Выделить всё
Traceback (most recent call last):
File "/home/drive/DriveDreamer-main/DriveDreamer-main/./dreamer-train/projects/launch.py", line 35, in
main()
File "/home/drive/DriveDreamer-main/DriveDreamer-main/./dreamer-train/projects/launch.py", line 32, in main
launch_from_config(config_path, ','.join(opts.runners))
File "/home/drive/DriveDreamer-main/DriveDreamer-main/dreamer-train/dreamer_train/distributed/launch.py", line 175, in launch_from_config
launcher.launch('{} --config {} --runners {}'.format(file_path, config_path, runners))
File "/home/drive/DriveDreamer-main/DriveDreamer-main/dreamer-train/dreamer_train/distributed/launch.py", line 159, in launch
os.remove(self.hostfile_path)
FileNotFoundError: [Errno 2] No such file or directory: '_tmp/2024-12-06-185351_hostfile'
При запуске кода этот код создаст файл конфигурации и файл хост-файла с метками времени. , но хост-файл не может быть создан правильно.
Подробнее здесь: https://stackoverflow.com/questions/792 ... -correctly
Мобильная версия