DITRIB_ID=Ubuntu
DISPRIB_RELEASE=22.04
Код: Выделить всё
ubuntu@ubuntu:~$ nvidia-smi
Thu Apr 11 11:39:36 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:01:00.0 Off | N/A |
| 0% 33C P8 33W / 350W | 12MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:2B:00.0 Off | N/A |
| 0% 37C P8 34W / 350W | 12MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce RTX 3090 On | 00000000:41:00.0 On | N/A |
| 0% 33C P8 30W / 350W | 155MiB / 24576MiB | 3% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce RTX 3090 On | 00000000:61:00.0 Off | N/A |
| 0% 32C P8 32W / 350W | 12MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
Код: Выделить всё
Package Version
------------------------- --------------
accelerate 0.29.2
bitsandbytes 0.43.0
deepspeed 0.14.0
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.19.3
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.1.105
peft 0.10.0
safetensors 0.4.2
tokenizers 0.15.2
torch 2.2.2
transformers 4.39.3
trl 0.8.1
Код: Выделить всё
CUDA_VISIBLE_DEVICES=0 python3 train.py
Код: Выделить всё
CUDA_VISIBLE_DEVICES=1 python3 train.py
Код: Выделить всё
CUDA_VISIBLE_DEVICES=2 python3 train.py
Код: Выделить всё
CUDA_VISIBLE_DEVICES=3 python3 train.py
Код: Выделить всё
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 train.py`
I got the CUDA Loss warnings and end up with CUDA error.
`warnings.warn(
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
.
.
.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [30,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [31,0,0] Assertion `t >= 0 && t < n_classes` failed.
Код: Выделить всё
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
DITRIB_ID=Ubuntu
DITRIB_RELEASE =20.04
Код: Выделить всё
ubuntu@ubuntu:~$ nvidia-smi
Thu Apr 11 11:52:58 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 Off| 00000000:01:00.0 Off | N/A |
| 0% 43C P8 40W / 390W| 251MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 Off| 00000000:06:00.0 Off | N/A |
| 0% 35C P8 44W / 390W| 10MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
Код: Выделить всё
Package Version
------------------------- ------------
accelerate 0.24.1
bitsandbytes 0.41.1
deepspeed 0.11.1
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.3.52
nvidia-nvtx-cu12 12.1.105
peft 0.5.0
safetensors 0.4.0
sentencepiece 0.1.99
tokenizers 0.14.1
torch 2.1.0
transformers 4.34.1
trl 0.7.11
Код: Выделить всё
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc-per-node=4 train.py
Я также пытался собрать среду на хосте, также получил те же предупреждения и ошибки. .
Большое спасибо!!
Я пытался установить ту же среду на первом сервере, но получил те же предупреждения и ошибки.
Я также пытался создать среду на хосте и получил те же предупреждения и ошибки.
Подробнее здесь: https://stackoverflow.com/questions/783 ... rnings-and