Задание CUDA внезапно завершается сбоем на H100, а затем выдает сообщение, что cuda недоступна до перезагрузкиPython

Программы на Python
Ответить Пред. темаСлед. тема
Anonymous
 Задание CUDA внезапно завершается сбоем на H100, а затем выдает сообщение, что cuda недоступна до перезагрузки

Сообщение Anonymous »

Моя работа H100 с несколькими графическими процессорами внезапно завершается сбоем в Rocky Linux после примерно дня обучения со следующей ошибкой, а затем до перезагрузки графический процессор недоступен, он продолжает выдавать torch.cuda.is_available() как ЛОЖЬ. Простые программы CUDA также не работают. Я пробовал разные версии CUDA и драйверов, но это не помогло.
[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 2] Process group watchdog thread terminated with exception: CUDA error: unspecified launch failure CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x14f5799e8897 in /home/skatar6/.local/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x14f579998b25 in /home/skatar6/.local/lib/python3.9/site-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x14f579ac0718 in /home/skatar6/.local/lib/python3.9/site-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const
+ 0x56 (0x14f57acbd8e6 in /home/skatar6/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x14f57acc19e8 in /home/skatar6/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x14f57acc705c in /home/skatar6/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x14f57acc7dcc in /home/skatar6/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #7: + 0xd3b75 (0x14f5c6777b75 in /mnt/beegfs/home/skatar6/anaconda3/envs/tmp4/bin/../lib/libstdc++.so.6) frame #8: + 0x89c02 (0x14f5c75e7c02 in /lib64/libc.so.6) frame #9: + 0x10ec40 (0x14f5c766cc40 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 2] Process group watchdog thread terminated with exception: CUDA error: unspecified launch failure CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x14f5799e8897 in /home/skatar6/.local/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x14f579998b25 in /home/skatar6/.local/lib/python3.9/site-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x14f579ac0718 in /home/skatar6/.local/lib/python3.9/site-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const
+ 0x56 (0x14f57acbd8e6 in /home/skatar6/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x14f57acc19e8 in /home/skatar6/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x14f57acc705c in /home/skatar6/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x14f57acc7dcc in /home/skatar6/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #7: + 0xd3b75 (0x14f5c6777b75 in /mnt/beegfs/home/skatar6/anaconda3/envs/tmp4/bin/../lib/libstdc++.so.6) frame #8: + 0x89c02 (0x14f5c75e7c02 in /lib64/libc.so.6) frame #9: + 0x10ec40 (0x14f5c766cc40 in /lib64/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x14f5799e8897 in /home/skatar6/.local/lib/python3.9/site-packages/torch/lib/libc10.so) frame #1: + 0xe32119 (0x14f57a94b119 in /home/skatar6/.local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xd3b75 (0x14f5c6777b75 in /mnt/beegfs/home/skatar6/anaconda3/envs/tmp4/bin/../lib/libstdc++.so.6) frame #3: + 0x89c02 (0x14f5c75e7c02 in /lib64/libc.so.6) frame #4: + 0x10ec40 (0x14f5c766cc40 in /lib64/libc.so.6) W1101 18:26:30.049387 22421810101312 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 75430 closing signal SIGTERM W1101 18:26:30.054049 22421810101312 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 75431 closing signal SIGTERM W1101 18:26:30.079263 22421810101312 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 75433 closing signal SIGTERM Traceback (most recent call last): File "/home/skatar6/anaconda3/envs/tmp4/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
finalizer() File "/home/skatar6/anaconda3/envs/tmp4/lib/python3.9/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs) File "/home/skatar6/anaconda3/envs/tmp4/lib/python3.9/multiprocessing/util.py", line 133, in _remove_temp_dir
rmtree(tempdir) File "/home/skatar6/anaconda3/envs/tmp4/lib/python3.9/shutil.py", line 734, in rmtree
_rmtree_safe_fd(fd, path, onerror) File "/home/skatar6/anaconda3/envs/tmp4/lib/python3.9/shutil.py", line 690, in _rmtree_safe_fd
onerror(os.unlink, fullname, sys.exc_info()) File "/home/skatar6/anaconda3/envs/tmp4/lib/python3.9/shutil.py", line 688, in _rmtree_safe_fd
os.unlink(entry.name, dir_fd=topfd) OSError: [Errno 16] Device or resource busy: '.nfs0000000a80005197000000ac' E1101 18:26:31.068613 22421810101312 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 2 (pid: 75432) of binary: /home/skatar6/anaconda3/envs/tmp4/bin/python Traceback (most recent call last): File "/home/skatar6/anaconda3/envs/tmp4/bin/accelerate", line 8, in
sys.exit(main()) File "/home/skatar6/anaconda3/envs/tmp4/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args) File "/home/skatar6/anaconda3/envs/tmp4/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args) File "/home/skatar6/anaconda3/envs/tmp4/lib/python3.9/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args) File "/home/skatar6/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch( File "/home/skatar6/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args)) File "/home/skatar6/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
====================================================== pretrain_iter.py FAILED
------------------------------------------------------ Failures:
------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-11-01_18:26:30 host : nodex.cluster rank : 2 (local_rank: 2) exitcode : -6 (pid: 75432) error_file: traceback : Signal 6 (SIGABRT) received by PID 75432
====================================================== Traceback (most recent call last): File "/home/skatar6/anaconda3/envs/tmp4/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
finalizer() File "/home/skatar6/anaconda3/envs/tmp4/lib/python3.9/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs) File "/home/skatar6/anaconda3/envs/tmp4/lib/python3.9/multiprocessing/util.py", line 133, in _remove_temp_dir
rmtree(tempdir) File "/home/skatar6/anaconda3/envs/tmp4/lib/python3.9/shutil.py", line 734, in rmtree
_rmtree_safe_fd(fd, path, onerror) File "/home/skatar6/anaconda3/envs/tmp4/lib/python3.9/shutil.py", line 690, in _rmtree_safe_fd
onerror(os.unlink, fullname, sys.exc_info()) File "/home/skatar6/anaconda3/envs/tmp4/lib/python3.9/shutil.py", line 688, in _rmtree_safe_fd
os.unlink(entry.name, dir_fd=topfd) OSError: [Errno 16] Device or resource busy: '.nfs0000000b00001b84000000ad'


Подробнее здесь: https://stackoverflow.com/questions/791 ... til-reboot
Реклама
Ответить Пред. темаСлед. тема

Быстрый ответ

Изменение регистра текста: 
Смайлики
:) :( :oops: :roll: :wink: :muza: :clever: :sorry: :angel: :read: *x)
Ещё смайлики…
   
К этому ответу прикреплено по крайней мере одно вложение.

Если вы не хотите добавлять вложения, оставьте поля пустыми.

Максимально разрешённый размер вложения: 15 МБ.

  • Похожие темы
    Ответы
    Просмотры
    Последнее сообщение

Вернуться в «Python»