Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Under our Slurm nodes, hybrid engine resets to 1 MPI process and 1 Pthread, and fails when expected > 1 processes #33

Open
KADichev opened this issue Oct 28, 2024 · 1 comment
Assignees

Comments

@KADichev
Copy link
Collaborator

KADichev commented Oct 28, 2024

If we book a node e.g. via

srun -p Cascade --ntasks 1 --cpus-per-task 32  -t 08:00:00 --pty /bin/bash

Note that this is NOT the typical way to ask for MPI resources, but I prefer it because we actually get many cores which compile fast.

E.g. using the branch
https://github.com/Algebraic-Programming/LPF/tree/functional_tests_use_gtest

then most hybrid engine jobs will fail any test checking that we run with > 1 task. For example:

ctest -R hybrid_API.func_lpf_exec_multiple_call_single_arg_dual_proc --verbose
UpdateCTestConfiguration  from :/home/kdichev/LPF/build-x86/DartConfiguration.tcl
UpdateCTestConfiguration  from :/home/kdichev/LPF/build-x86/DartConfiguration.tcl
Test project /home/kdichev/LPF/build-x86
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 583
    Start 583: hybrid_API.func_lpf_exec_multiple_call_single_arg_dual_proc

583: Test command: /home/kdichev/LPF/build-x86/test_launcher.py "-e" "hybrid" "-L" "/home/kdichev/LPF/build-x86/lpfrun_build" "-p" "2" "-P" "5" "-t" "0.0" "-R" "0" "/home/kdichev/LPF/build-x86/tests/functional/func_lpf_exec_multiple_call_single_arg_dual_proc_hybrid_Release_debug" "--gtest_filter=API.func_lpf_exec_multiple_call_single_arg_dual_proc" "--gtest_also_run_disabled_tests" "--gtest_output=xml:/home/kdichev/LPF/build-x86/junit/hybrid_func_lpf_exec_multiple_call_single_arg_dual_proc_hybrid_Release_debug"
583: Working Directory: /home/kdichev/LPF/build-x86/tests/functional
583: Test timeout computed to be: 10000000
583: Running main() from /scratch/kdichev/.spack/stage/spack-stage-googletest-1.14.0-afvplm5m2qrmzvpapg7hx7dbfqff332z/spack-src/googletest/src/gtest_main.cc
583: Note: Google Test filter = API.func_lpf_exec_multiple_call_single_arg_dual_proc
583: [==========] Running 1 test from 1 test suite.
583: [----------] Global test environment set-up.
583: [----------] 1 test from API
583: [ RUN      ] API.func_lpf_exec_multiple_call_single_arg_dual_proc
583: /home/kdichev/LPF/tests/functional/func_lpf_exec_multiple_call_single_arg_dual_proc.cpp:31: Failure
583: Expected equality of these values:
583:   nprocs
583:     Which is: 1
583:   2
583: 
583: /home/kdichev/LPF/tests/functional/func_lpf_exec_multiple_call_single_arg_dual_proc.cpp:56: Failure
583: Expected equality of these values:
583:   nprocs
583:     Which is: 1
583:   2
583: 
583: [  FAILED  ] API.func_lpf_exec_multiple_call_single_arg_dual_proc (138 ms)
583: [----------] 1 test from API (138 ms total)
583: 
583: [----------] Global test environment tear-down
583: [==========] 1 test from 1 test suite ran. (138 ms total)
583: [  PASSED  ] 0 tests.
583: [  FAILED  ] 1 test, listed below:
583: [  FAILED  ] API.func_lpf_exec_multiple_call_single_arg_dual_proc
583: 
583:  1 FAILED TEST
583: --------------------------------------------------------------------------
583: Primary job  terminated normally, but 1 process returned
583: a non-zero exit code. Per user-direction, the job has been aborted.
583: --------------------------------------------------------------------------
583: --------------------------------------------------------------------------
583: mpirun detected that one or more processes exited with non-zero status, thus causing
583: the job to be terminated. The first process to do so was:
583: 
583:   Process name: [[15695,1],0]
583:   Exit code:    1
583: --------------------------------------------------------------------------
583: Run command: 
583: ['/home/kdichev/LPF/build-x86/lpfrun_build', '-engine', 'hybrid', '-n', '2', '/home/kdichev/LPF/build-x86/tests/functional/func_lpf_exec_multiple_call_single_arg_dual_proc_hybrid_Release_debug', '--gtest_filter=API.func_lpf_exec_multiple_call_single_arg_dual_proc', '--gtest_also_run_disabled_tests', '--gtest_output=xml:/home/kdichev/LPF/build-x86/junit/hybrid_func_lpf_exec_multiple_call_single_arg_dual_proc_hybrid_Release_debug']
583: Test returned code = 1
583: Test /home/kdichev/LPF/build-x86/tests/functional/func_lpf_exec_multiple_call_single_arg_dual_proc_hybrid_Release_debug--gtest_filter=API.func_lpf_exec_multiple_call_single_arg_dual_proc
583: returned	1
583: expected return code was: 0
1/1 Test #583: hybrid_API.func_lpf_exec_multiple_call_single_arg_dual_proc ...***Failed    3.04 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) =   3.89 sec

The following tests FAILED:
	583 - hybrid_API.func_lpf_exec_multiple_call_single_arg_dual_proc (Failed)
Errors while running CTest
Output from these tests are in: /home/kdichev/LPF/build-x86/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.


@KADichev
Copy link
Collaborator Author

It seems to me the lpfrun launcher is broken:

HYBRID SLURM:   true
HYBRID TASKS:   1
HYBRID NODES:   1
HYBRID DEFAULT PROCESSES PER NODE: node
HYBRID PROCESS MAPPING: One process per compute node
HYBRID PINNING: exact pinning enabled
ws01 process 1 of 1: THREADS 1; PIN STRATEGY none; SPINLOCK FAST
ws01 process 1 of 1: CPUMASK  
ws01 process 1 of 1: EXECUTES /home/kdichev/LPF/build-x86/tests/functional/func_lpf_exec_multiple_call_single_arg_dual_proc_hybrid_Release_debug

@KADichev KADichev changed the title Under our Slurm nodes, hybrid engine does not use more than 1 process and 1 thread, and fails Under our Slurm nodes, hybrid engine resets to 1 MPI process and 1 Pthread, and fails when expected > 1 processes Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants