// final block MPI_Info_free(info1); MPI_Free_mem(tab1); MPI_Finalize();
return 0; } [/code] Я получил разные результаты в зависимости от используемой версии OpenMPI: [code]$ mpirun --version mpirun (Open MPI) 5.0.8 $ mpiCC --version g++ (GCC) 15.2.1 20251211 (Red Hat 15.2.1-5)
$ mpiCC test99.cpp $ mpirun -n 3 a.out 210 [grad:78087:0:78087] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f707ae61978) [grad:78085:0:78085] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f83ecc61978) [grad:78086:0:78086] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f7382461978) ==== backtrace (tid: 78087) ==== 0 /lib64/libucs.so.0(ucs_handle_error+0x2e4) [0x7f707a1b2df4] 1 /lib64/libucs.so.0(+0x17aed) [0x7f707a1b4aed] 2 /lib64/libucs.so.0(+0x17cbd) [0x7f707a1b4cbd] 3 /lib64/libc.so.6(+0x1a070) [0x7f707aa28070] 4 /usr/lib64/openmpi/lib/libmpi.so.40(PMPI_Info_create+0x26) [0x7f707b07d5c6] 5 a.out() [0x4005c0] 6 /lib64/libc.so.6(+0x3575) [0x7f707aa11575] 7 /lib64/libc.so.6(__libc_start_main+0x88) [0x7f707aa11628] 8 a.out() [0x400445] ================================= ==== backtrace (tid: 78085) ==== 0 /lib64/libucs.so.0(ucs_handle_error+0x2e4) [0x7f83eccb3df4] 1 /lib64/libucs.so.0(+0x17aed) [0x7f83eccb5aed] 2 /lib64/libucs.so.0(+0x17cbd) [0x7f83eccb5cbd] 3 /lib64/libc.so.6(+0x1a070) [0x7f83ec828070] 4 /usr/lib64/openmpi/lib/libmpi.so.40(PMPI_Info_create+0x26) [0x7f83ece7d5c6] 5 a.out() [0x4005c0] 6 /lib64/libc.so.6(+0x3575) [0x7f83ec811575] 7 /lib64/libc.so.6(__libc_start_main+0x88) [0x7f83ec811628] 8 a.out() [0x400445] ================================= ==== backtrace (tid: 78086) ==== 0 /lib64/libucs.so.0(ucs_handle_error+0x2e4) [0x7f7381d8bdf4] 1 /lib64/libucs.so.0(+0x17aed) [0x7f7381d8daed] 2 /lib64/libucs.so.0(+0x17cbd) [0x7f7381d8dcbd] 3 /lib64/libc.so.6(+0x1a070) [0x7f7382028070] 4 /usr/lib64/openmpi/lib/libmpi.so.40(PMPI_Info_create+0x26) [0x7f738267d5c6] 5 a.out() [0x4005c0] 6 /lib64/libc.so.6(+0x3575) [0x7f7382011575] 7 /lib64/libc.so.6(__libc_start_main+0x88) [0x7f7382011628] 8 a.out() [0x400445] ================================= -------------------------------------------------------------------------- prterun noticed that process rank 0 with PID 78085 on node grad exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- [/code] и для более старой версии OpenMPI: [code]$ mpirun --version mpirun (Open MPI) 4.1.1 $ mpiCC --version g++ (GCC) 11.5.0 20240719 (Red Hat 11.5.0-11)
$ mpiCC test99.cpp $ mpirun -n 3 a.out 200 [vmi2927342:31228:0:31228] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil)) ==== backtrace (tid: 31228) ==== 0 /lib64/libucs.so.0(ucs_handle_error+0x2e4) [0x7f0bb0b80714] 1 /lib64/libucs.so.0(+0x2a2ac) [0x7f0bb0b822ac] 2 /lib64/libucs.so.0(+0x2a46a) [0x7f0bb0b8246a] 3 /lib64/libc.so.6(__cxa_finalize+0x60) [0x7f0bba241870] 4 /usr/lib64/openmpi/lib/openmpi/mca_pml_ucx.so(+0x3987) [0x7f0bb808b987] ================================= [vmi2927342:31228] *** Process received signal *** [vmi2927342:31228] Signal: Segmentation fault (11) [vmi2927342:31228] Signal code: (-6) [vmi2927342:31228] Failing at address: 0x3e8000079fc [vmi2927342:31228] [ 0] /lib64/libc.so.6(+0x3fc30)[0x7f0bba23fc30] [vmi2927342:31228] [ 1] /lib64/libc.so.6(__cxa_finalize+0x60)[0x7f0bba241870] [vmi2927342:31228] [ 2] /usr/lib64/openmpi/lib/openmpi/mca_pml_ucx.so(+0x3987)[0x7f0bb808b987] [vmi2927342:31228] *** End of error message *** [vmi2927342:31227:0:31227] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil)) ==== backtrace (tid: 31227) ==== 0 /lib64/libucs.so.0(ucs_handle_error+0x2e4) [0x7f89601ef714] 1 /lib64/libucs.so.0(+0x2a2ac) [0x7f89601f12ac] 2 /lib64/libucs.so.0(+0x2a46a) [0x7f89601f146a] 3 /lib64/libc.so.6(__cxa_finalize+0x60) [0x7f8964441870] 4 /usr/lib64/openmpi/lib/openmpi/mca_pml_ucx.so(+0x3987) [0x7f896216b987] ================================= [vmi2927342:31227] *** Process received signal *** [vmi2927342:31227] Signal: Segmentation fault (11) [vmi2927342:31227] Signal code: (-6) [vmi2927342:31227] Failing at address: 0x3e8000079fb [vmi2927342:31227] [ 0] /lib64/libc.so.6(+0x3fc30)[0x7f896443fc30] [vmi2927342:31227] [ 1] /lib64/libc.so.6(__cxa_finalize+0x60)[0x7f8964441870] [vmi2927342:31227] [ 2] /usr/lib64/openmpi/lib/openmpi/mca_pml_ucx.so(+0x3987)[0x7f896216b987] [vmi2927342:31227] *** End of error message *** [vmi2927342:31226:0:31226] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil)) ==== backtrace (tid: 31226) ==== 0 /lib64/libucs.so.0(ucs_handle_error+0x2e4) [0x7fdb3c04c714] 1 /lib64/libucs.so.0(+0x2a2ac) [0x7fdb3c04e2ac] 2 /lib64/libucs.so.0(+0x2a46a) [0x7fdb3c04e46a] 3 /lib64/libc.so.6(__cxa_finalize+0x60) [0x7fdb44c41870] 4 /usr/lib64/openmpi/lib/openmpi/mca_pml_ucx.so(+0x3987) [0x7fdb3f69c987] ================================= [vmi2927342:31226] *** Process received signal *** [vmi2927342:31226] Signal: Segmentation fault (11) [vmi2927342:31226] Signal code: (-6) [vmi2927342:31226] Failing at address: 0x3e8000079fa [vmi2927342:31226] [ 0] /lib64/libc.so.6(+0x3fc30)[0x7fdb44c3fc30] [vmi2927342:31226] [ 1] /lib64/libc.so.6(__cxa_finalize+0x60)[0x7fdb44c41870] [vmi2927342:31226] [ 2] /usr/lib64/openmpi/lib/openmpi/mca_pml_ucx.so(+0x3987)[0x7fdb3f69c987] [vmi2927342:31226] *** End of error message *** -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that process rank 1 with PID 0 on node vmi2927342 exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- [/code] [list] [*]OpenMPI 5 прекращает работу в начале выполнения MPI_Info_create. Этого не происходит в версии 4. [*]OpenMPI версии 4 выполняет все до MPI_Finalize в конце, когда он возвращает segfault для всех рангов с (nil). [/list] Не вижу в исходном коде ничего неправильного, что могло бы вызвать эти проблемы.