Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POT3D CPU version fails #15

Closed
khayamgondal opened this issue Feb 21, 2025 · 2 comments
Closed

POT3D CPU version fails #15

khayamgondal opened this issue Feb 21, 2025 · 2 comments

Comments

@khayamgondal
Copy link

khayamgondal commented Feb 21, 2025

Hello, I am trying to run pot 3d as mpiexec --bind-to core -np 72 --allow-run-as-root ./pot3d_cpu
but the CPU version of code fails with the following error

WARNING: Open MPI tried to bind a process but failed.  This is a
warning only; your job will continue, though performance may
be degraded.

  Local host:        d2d9328e364d
  Application name:  ./pot3d_cpu
  Error message:     failed to bind memory
  Location:          ../../../../../orte/mca/rtc/hwloc/rtc_hwloc.c:447

--------------------------------------------------------------------------
[1740172275.508283] [d2d9328e364d:80   :0]        mm_iface.c:821  UCX  ERROR mm_iface failed to allocate receive FIFO
[d2d9328e364d:00080] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:309  Error: Failed to create UCP worker
[d2d9328e364d:00080] [[27290,1],15] selected pml ob1, but peer [[27290,1],0] on d2d9328e364d selected pml ucx
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another.  This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used.  Your MPI job will now abort.

You may wish to try to narrow down the problem;

 * Check the output of ompi_info to see which BTL/MTL plugins are
   available.
 * Run your application with MPI_THREAD_SINGLE.
 * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
   if using MTL-based communications) to see exactly which
   communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
[d2d9328e364d:00080] *** An error occurred in MPI_Init_thread
[d2d9328e364d:00080] *** reported by process [1788477441,15]
[d2d9328e364d:00080] *** on a NULL communicator
[d2d9328e364d:00080] *** Unknown error
[d2d9328e364d:00080] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[d2d9328e364d:00080] ***    and potentially your MPI job)
[d2d9328e364d:00041] 71 more processes have sent help message help-orte-odls-default.txt / memory not bound
[d2d9328e364d:00041] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Content of pot3d.dat file

 &topology
  nr=353
  nt=952
  np=2377
 /
 &inputvars
  ifprec=2
  option='ss'
  r1=2.5
  rfrac=0.0,1.0
  drratio=2.5
  nfrmesh=0
  tfrac=0.00
  dtratio=1.0
  nftmesh=0
  pfrac=0.00
  dpratio=1.0
  nfpmesh=0
  phishift=0.
  br0file='br_input_medium.h5'
  phifile=''
  brfile=''
  btfile=''
  bpfile=''
  br_photo_file=''
  ncghist=100
  ncgmax=1000000
  epscg=1.e-3
  idebug=0
 /


@sumseq
Copy link
Collaborator

sumseq commented Feb 21, 2025

Hi,

This appears to be an issue with your MPi library combined with your system, not POT3D related.

You are running with 72 ranks - do you have 72 cores?

Also, why are you running as root?

What if you run it with: mpiexec -np 1 ./pot3d_cpu?
Does that work?

If so, than you can try with "np" set to the number of CPU cores on your machine.

I do not recommend running the code in hybrid mode (MPI rank + CPU threads) as using pure MPI is typically always faster than hybrid mode.

-- Ron

@khayamgondal
Copy link
Author

Thanks Ron, yes for this particular system I have 72 cores. I think it was the grid size causing the crash. By reducing the grid size it is working fine now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants