-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: async GPU force/torque back-transfer, launch kernels earlier #2637
base: python
Are you sure you want to change the base?
Conversation
@@ -336,26 +326,45 @@ void copy_part_data_to_gpu(ParticleRange particles) { | |||
} | |||
} | |||
|
|||
std::unique_ptr<PinnedVectorHost<float>> particle_forces_host{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't like this construction. I a way this is worse than manual memory management, because it pretends to be RAII, but really isn't. As far as I can see there is not reason for these buffers to exist statically, couldn't you just create them e.g. when the integration starts, and release them after. Depending on the cost of the allocation, it may even be feasible to create it in the force calculation.
#include "thrust/host_vector.h" | ||
#include "thrust/system/cuda/experimental/pinned_allocator.h" | ||
template <class T> | ||
using PinnedVectorHost = thrust::host_vector< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be called PinnedHostVector
to keep it closer to thrusts name.
It is unfortunate that one cannot convince Python to actually call the destructor of the system class. For 1k particles, the alloc takes approx as long as an md time step with lj at 0.1 volume fraction (200us). It would probably surprise users if The inner-most place one could move the vectors to is |
Can you please explain in more detail what the lifetime issue is? It's not totally clear to me. |
Also you seemed to have timed the allocation, could you please share your results and how you have measured that? It's not clear to me why a single allocation (or two) would be so expensive, my understanding is that the pinned allocation is just a normal allocation followed by a |
On Tue, Apr 02, 2019 at 04:07:08AM -0700, Florian Weik wrote:
Can you please explain in more detail what the lifetime issue is? It's not totally clear to me.
For the evaluation of observables, the main integration function, integrate_vv(), is interupted, because ES switches to master-slave mode.
If there are auto update accumulators, integrate_vv is called n times with 1 step.
|
On Tue, Apr 02, 2019 at 04:16:05AM -0700, Florian Weik wrote:
Also you seemed to have time the allocation, could you please share your results and how you have measured that? It's not clear to me why a single allocation (or two) would be so expensive, my understanding is that the pinned allocation is just a normal allocation followed by a `mlock` call. This should not be expensive (if you are not out of RAM).
I used nvprof.
Percentages are with regards to the total time of all cuda api calls on the host.
0.14% 471.65us 2 235.83us 8.4520us 463.20us cudaMallocHost
0.09% 321.79us 2 160.89us 22.721us 299.07us cudaFreeHost
So the total time for allocation and de-allocation is .7ms.
One time step (without alloc/free) takes 0.28ms (with lb without lj).
|
Again: Can you please explain in more detail what the lifetime issue is? It's not totally clear to me. |
On Tue, Apr 02, 2019 at 06:43:22AM -0700, Florian Weik wrote:
Again: Can you please explain in more detail what the lifetime issue is? It's not totally clear to me.
I'm not sure, I understand the question.
If you are referring to where the need for the exit handler arises:
The thrust allocator used in the vectors calls cudaFree at destruction to release the pinned memory.
That fails, if the cuda runtime has been unloaded already.
AFAIK, we cannot control the oder in which shared objects are unloaded, so
Cuda might be unloaded before libEspressoCore.
Then, the cudaFree()-call releasing the vectorss' memories throws an exception.
|
Wouldn't you have the same problem with |
It is not clear to me.
|
It seems to me that I forgot linking to cudart: (From function(add_gpu_library)
cuda_add_library(${ARGV})
set_property(TARGET ${ARGV0} PROPERTY CUDA_SEPARABLE_COMPILATION ON)
target_link_libraries(${ARGV0} PRIVATE ${CUDA_CUFFT_LIBRARIES})
endfunction() Could you please add it (like for the clang case) and check if this improves matters? (I probably thought that |
So libcudart was linked statically already. Linking it dynamically did not help. It is possible to avoid the exit hanlder by storing the vectors containig the pinned memory in a wrapper class which explicitly creates a cuda context. An releases it after destructing the vectors. This solution still needs work, e.g., for switching gpus. It also needs to be moved/integrated with cuda init and device switching. |
Ok I think we can go forward with this. In general I think we should seek to remove all the non-trivial globals rather sooner than later. We've had multiple issues with them. |
@RudolfWeeber it seems like there is a file missing here (where the feature detection in python went), could you pleas add that? I am currently working on the async forward communication and want to integrate this, but it is not working. |
@RudolfWeeber are you still looking into this? |
> <https://github.com/RudolfWeeber> @RudolfWeeber are you still looking into this?
IIRC, you said you wanted to integrate this into the non-blocking mpi-node -> gpu communication.
If that’s not the case, I will look into it once I’m back from Paris.
It also depends on the decision we take with regards to LBGPU.
Even if we don’t do the full thing, the early starting of the GPU methods could be merged independently.
|
It seems to me that you forgot to check in some files.
…On Wed, Jul 3, 2019, 20:42 RudolfWeeber ***@***.***> wrote:
>> <https://github.com/RudolfWeeber> @RudolfWeeber are you still looking
into this?
IIRC, you said you wanted to integrate this into the non-blocking mpi-node
-> gpu communication.
If that’s not the case, I will look into it once I’m back from Paris.
It also depends on the decision we take with regards to LBGPU.
Even if we don’t do the full thing, the early starting of the GPU methods
could be merged independently.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#2637?email_source=notifications&email_token=AAG2FX55ZFT2MQWMGV6QEFLP5TXKLA5CNFSM4HC6ND32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZFLEKI#issuecomment-508211753>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAG2FX5KIWFMRG7H7ODYHJLP5TXKLANCNFSM4HC6ND3Q>
.
|
It seems to me that you forgot to check in some files.
Even including commit afded8?
What is still missing?
|
Ah maybe there is just an guard missing. I'll have a look |
@fweik is this obsolete by now? |
No. Let's keep this open for now. |
This uses the thrust pinned memory allocator for the vectors receiving gpu forces/torques on the host.
Then, the back-transfer can be asynchroneous.
The host force/torque vector have to remain globals, because the allocation of pinned memory takes at least as long as the data transfer itself.
A Python-level exit handler maeks sure, the vectors are de-allocated before cuda is unloaded (becuase the custom allocator calls cudaFree())
The performance benefits apply mostly to dense systems and probably also to systems with long range interactions.