Tensorflow и CUDA использование графического процессора и совместимость с CUDAPython

Программы на Python
Ответить Пред. темаСлед. тема
Anonymous
 Tensorflow и CUDA использование графического процессора и совместимость с CUDA

Сообщение Anonymous »

Я пытаюсь обучить модель Tensorflow на GPU, но я изо всех сил пытался ее запустить.
По сути, я использую высокопроизводительный компьютерный кластер и отправляю задания, используя избегальные и стебные ошибки.
Моя среда-следующее: < / p > < b r / > < c o d e > # p a c k a g e s i n e n v i r o n m e n t a t / d a t a / l e u v e n / 3 5 1 / v s c 3 5 1 3 2 / m i n i c o n d a 3 / e n v s / S p e c C h e c k : < b r / > # < b r / > # N a m e V e r s i o n B u i l d C h a n n e l < b r / > _ l i b g c c _ m u t e x 0 . 1 m a i n < b r / > _ o p e n m p _ m u t e x 5 . 1 1 _ g n u < b r / > a b s l - p y 2 . 2 . 1 p y p i _ 0 p y p i < b r / > a s t u n p a r s e 1 . 6 . 3 p y p i _ 0 p y p i < b r / > b z i p 2 1 . 0 . 8 h 5 e e e 1 8 b _ 6 < b r / > c a - c e r t i f i c a t e s 2 0 2 5 . 2 . 2 5 h 0 6 a 4 3 0 8 _ 0 < b r / > c e r t i f i 2 0 2 5 .1.31 pypi_0 pypi
charset-normalizer 3.4.1 pypi_0 pypi
expat 2.6.4 h6a678d5_0
flatbuffers 25.2.10 pypi_0 pypi
gast 0.6.0 pypi_0 pypi
google-pasta 0.2.0 pypi_0 pypi
grpcio 1.71.0 pypi_0 pypi
h5py 3.13.0 pypi_0 pypi
idna 3.10 pypi_0 pypi
joblib 1.4.2 pypi_0 pypi
keras 3.9.1 pypi_0 pypi
ld_impl_linux-64 2.40 h12ee557_0
libclang 18.1.1 pypi_0 pypi
libffi 3.4.4 h6a678d5_1
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libstdcxx-ng 11.2.0 h1234567_1
libuuid 1.41.5 h5eee18b_0
markdown 3.7 pypi_0 pypi
markdown-it-py 3.0.0 pypi_0 pypi
markupsafe 3.0.2 pypi_0 pypi
mdurl 0.1.2 pypi_0 pypi
ml-dtypes 0.5.1 pypi_0 pypi
namex 0.0.8 pypi_0 pypi
ncurses 6.4 h6a678d5_0
numpy 2.1.3 pypi_0 pypi
nvidia-cublas-cu12 12.5.3.2 pypi_0 pypi
nvidia-cuda-cupti-cu12 12.5.82 pypi_0 pypi
nvidia-cuda-nvcc-cu12 12.5.82 pypi_0 pypi
nvidia-cuda-nvrtc-cu12 12.5.82 pypi_0 pypi
nvidia-cuda-runtime-cu12 12.5.82 pypi_0 pypi
nvidia-cudnn-cu12 9.3.0.75 pypi_0 pypi
nvidia-cufft-cu12 11.2.3.61 pypi_0 pypi
nvidia-curand-cu12 10.3.6.82 pypi_0 pypi
nvidia-cusolver-cu12 11.6.3.83 pypi_0 pypi
nvidia-cusparse-cu12 12.5.1.3 pypi_0 pypi
nvidia-nccl-cu12 2.23.4 pypi_0 pypi
nvidia-nvjitlink-cu12 12.5.82 pypi_0 pypi
openssl 3.0.16 h5eee18b_0
opt-einsum 3.4.0 pypi_0 pypi
optree 0.14.1 pypi_0 pypi
packaging 24.2 pypi_0 pypi
pandas 2.2.3 pypi_0 pypi
pip 25.0 py312h06a4308_0
protobuf 5.29.4 pypi_0 pypi
pyarrow 19.0.1 pypi_0 pypi
pygments 2.19.1 pypi_0 pypi
python 3.12.9 h5148396_0
python-dateutil 2.9.0.post0 pypi_0 pypi
pytz 2025.2 pypi_0 pypi
readline 8.2 h5eee18b_0
requests 2.32.3 pypi_0 pypi
rich 14.0.0 pypi_0 pypi
scikit-learn 1.6.1 pypi_0 pypi
scipy 1.15.2 pypi_0 pypi
setuptools 75.8.0 py312h06a4308_0
six 1.17.0 pypi_0 pypi
sqlite 3.45.3 h5eee18b_0
tensorboard 2.19.0 pypi_0 pypi
tensorboard-data-server 0.7.2 pypi_0 pypi
tensorflow 2.19.0 pypi_0 pypi
termcolor 3.0.0 pypi_0 pypi
threadpoolctl 3.6.0 pypi_0 pypi
tk 8.6.14 h39e8969_0
typing-extensions 4.13.0 pypi_0 pypi
tzdata 2025.2 pypi_0 pypi
urllib3 2.3.0 pypi_0 pypi
werkzeug 3.1.3 pypi_0 pypi
wheel 0.45.1 py312h06a4308_0
wrapt 1.17.2 pypi_0 pypi
xz 5.6.4 h5eee18b_1
zlib 1.2.13 h5eee18b_1
< /code>
И вывод из моего фарма составляет следующее: < /p>

Код: Выделить всё

========================================================================
/var/spool/slurmd/job64225291/slurm_script: line 14: nvcc: command not found
Tue Apr  1 12:45:15 2025
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA H100 80G...  On   | 00000000:26:00.0 Off |                    0 |
| N/A   28C    P0    73W / 700W |      0MiB / 81559MiB |      0%   E. Process |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
2025-04-01 12:45:24.661857: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-04-01 12:45:29.017117: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1743504330.375728  709539 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1743504330.649505  709539 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1743504332.740332  709539 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1743504332.740392  709539 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1743504332.740395  709539 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1743504332.740396  709539 computation_placer.cc:177] computation placer already registered.  Please check linkage and avoid linking the same target more than once.
2025-04-01 12:45:32.806499: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
I0000 00:00:1743504429.190211  709539 gpu_device.cc:2019] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 78430 MB memory:  -> device: 0, name: NVIDIA H100 80GB HBM3, pci bus id: 0000:26:00.0, compute capability: 9.0
2025-04-01 12:47:33.004061: W tensorflow/compiler/mlir/tools/kernel_gen/tf_gpu_runtime_wrappers.cc:40] 'cuModuleLoadData(&module, data)' failed with 'CUDA_ERROR_UNSUPPORTED_PTX_VERSION'

2025-04-01 12:47:33.004102: W tensorflow/compiler/mlir/tools/kernel_gen/tf_gpu_runtime_wrappers.cc:40] 'cuModuleGetFunction(&function, module, kernel_name)' failed with 'CUDA_ERROR_INVALID_HANDLE'

2025-04-01 12:47:33.004111: W tensorflow/core/framework/op_kernel.cc:1844] INTERNAL: 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE'
2025-04-01 12:47:33.004128: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: INTERNAL: 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE'
Memory growth set for GPUs
Traceback (most recent call last):
File "/vsc-hard-mounts/leuven-data/351/vsc35132/SpecCheck/SlurmFiles/../Scripts/SpecCheckVSC.py", line 944, in 
main()
File "/vsc-hard-mounts/leuven-data/351/vsc35132/SpecCheck/SlurmFiles/../Scripts/SpecCheckVSC.py", line 927, in main
fit_model(positive_file_path=args.Positive_samples,
File "/vsc-hard-mounts/leuven-data/351/vsc35132/SpecCheck/SlurmFiles/../Scripts/SpecCheckVSC.py", line 652, in fit_model
model = create_transformer_model(sequence_length, embedding_dim, num_heads, ff_dim, num_transformer_blocks,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vsc-hard-mounts/leuven-data/351/vsc35132/SpecCheck/SlurmFiles/../Scripts/SpecCheckVSC.py", line 285, in create_transformer_model
combined_embeddings = layers.Conv1D(filters=embedding_dim,kernel_size=5,padding='same',activation='relu')(combined_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/leuven/351/vsc35132/miniconda3/envs/h100_tf/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/data/leuven/351/vsc35132/miniconda3/envs/h100_tf/lib/python3.11/site-packages/keras/src/backend/tensorflow/core.py", line 139, in convert_to_tensor
return tf.cast(x, dtype)
^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.InternalError: {{function_node __wrapped__Cast_device_/job:localhost/replica:0/task:0/device:GPU:0}} 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE' [Op:Cast] name:

Однако я устанавливал Tensorflow, как указано на:
https://www.tensorflow.org/install/pipобразно>

Подробнее здесь: https://stackoverflow.com/questions/795 ... patability
Реклама
Ответить Пред. темаСлед. тема

Быстрый ответ

Изменение регистра текста: 
Смайлики
:) :( :oops: :roll: :wink: :muza: :clever: :sorry: :angel: :read: *x)
Ещё смайлики…
   
К этому ответу прикреплено по крайней мере одно вложение.

Если вы не хотите добавлять вложения, оставьте поля пустыми.

Максимально разрешённый размер вложения: 15 МБ.

  • Похожие темы
    Ответы
    Просмотры
    Последнее сообщение

Вернуться в «Python»