-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor parallel scaling efficiency due to MPI_gather_all #4
Comments
Your writeup seems to focus on number of walkers, but that is irrelevant unless it's less than number of MPI tasks. More walkers just means more iterations, not more work per iteration, and it's the latter that's parallelized. The main effect on parallel efficiency is that the length of the walk each MPI task does at each iteration is I couldn't find in your writeup what your number_of_model_calls_expected was, so I can't tell if your parallel scaling is expected or not. Other points:
|
Thanks for the fast reply, again. Ah ok, I'd assumed that parallelism was very walkers as the number of walkers needs to be an integer multiple of the number of cores. But thinking about nested sampling algo it makes sense that walkers are not parallelised, as only one is considered at a time. Why am I seeing improved scaling as the number of walkers is increased in the write-up I posted? A gather to root followed by broadcasting the energies would probably improve things. There are places in the code where gather all is used on one variable then the next variable, this will usually be slower than merging the data and doing one call to gather all. I'll check my input file when I'm back at a computer. |
I'm only aware of a single allgather that happens every iteration, and that's the maximum energy calculation. There are sendrecvs that are used to send cloned configurations around, but I think that's unavoidable given the current architecture (the alternative would be lots of many-to-one between nodes and some sort of root process, which isn't likely to be efficient either). There are also allgathers associated with infrequent things like saving snapshots or trajectories, but unless you prove otherwise I'm going to claim that they don't happen often enough to matter. |
Per e-mail conversation with Gabor I'm posting about this issue here. Basically the parallel scaling of pymatnest is relatively poor, and performance drops off at a relatively low number of CPU cores.
Taking Archer as an example (Archer is a Cray machine very similar to Titan in the US) with 1152 walkers I see a drop-off in parallel scaling after just 12 cores (1/2 a node), while with 11520 walkers I can only scale up to 48 cores. From looking at the code it seems very likely the problem lies with over-use of the MPI_gather_all routine, as this causes a lot of congestion between nodes (on Archer each node is 24 cores). Gabor informed me that he had trouble going beyond 96 cores (4 nodes).
I've posted my (brief) results from my tests on Archer here, with some discussion of the cause (see the pure MPI_gather_all test towards the end):
https://gist.github.com/erlendd/c236f393ed597187c612599cb472cd4b
The text was updated successfully, but these errors were encountered: