WIP: async GPU force/torque back-transfer, launch kernels earlier #2637

RudolfWeeber · 2019-04-02T09:48:31Z

This uses the thrust pinned memory allocator for the vectors receiving gpu forces/torques on the host.
Then, the back-transfer can be asynchroneous.
The host force/torque vector have to remain globals, because the allocation of pinned memory takes at least as long as the data transfer itself.
A Python-level exit handler maeks sure, the vectors are de-allocated before cuda is unloaded (becuase the custom allocator calls cudaFree())
The performance benefits apply mostly to dense systems and probably also to systems with long range interactions.

fweik · 2019-04-02T10:06:29Z

src/core/cuda_common_cuda.cu

@@ -336,26 +326,45 @@ void copy_part_data_to_gpu(ParticleRange particles) {
  }
 }

+std::unique_ptr<PinnedVectorHost<float>> particle_forces_host{


I really don't like this construction. I a way this is worse than manual memory management, because it pretends to be RAII, but really isn't. As far as I can see there is not reason for these buffers to exist statically, couldn't you just create them e.g. when the integration starts, and release them after. ~~Depending on the cost of the allocation, it may even be feasible to create it in the force calculation.~~

fweik · 2019-04-02T10:10:46Z

src/core/cuda_interface.hpp

+#include "thrust/host_vector.h"
+#include "thrust/system/cuda/experimental/pinned_allocator.h"
+template <class T>
+using PinnedVectorHost = thrust::host_vector<


This should be called PinnedHostVector to keep it closer to thrusts name.

RudolfWeeber · 2019-04-02T11:03:45Z

It is unfortunate that one cannot convince Python to actually call the destructor of the system class.
Then this could be done differently. As it is, I didn't find a working alternative to the exit handler.

For 1k particles, the alloc takes approx as long as an md time step with lj at 0.1 volume fraction (200us).
It scales much less then linearly with the number of particles

It would probably surprise users if
10x integrator.run(1) takes 20x the time than integrator.run(10)

The inner-most place one could move the vectors to is
python_integrate(), then, the vectors would stay allocated for sytems with auto update accumulators, at least.
AFAIK, there eis no observable recorder (yet) which could store un-processed observable results, though.

fweik · 2019-04-02T11:07:07Z

Can you please explain in more detail what the lifetime issue is? It's not totally clear to me.

fweik · 2019-04-02T11:16:05Z

Also you seemed to have timed the allocation, could you please share your results and how you have measured that? It's not clear to me why a single allocation (or two) would be so expensive, my understanding is that the pinned allocation is just a normal allocation followed by a mlock call. This should not be expensive (if you are not out of RAM).

RudolfWeeber · 2019-04-02T13:21:41Z

On Tue, Apr 02, 2019 at 04:07:08AM -0700, Florian Weik wrote: Can you please explain in more detail what the lifetime issue is? It's not totally clear to me.

For the evaluation of observables, the main integration function, integrate_vv(), is interupted, because ES switches to master-slave mode. If there are auto update accumulators, integrate_vv is called n times with 1 step.

RudolfWeeber · 2019-04-02T13:31:31Z

On Tue, Apr 02, 2019 at 04:16:05AM -0700, Florian Weik wrote: Also you seemed to have time the allocation, could you please share your results and how you have measured that? It's not clear to me why a single allocation (or two) would be so expensive, my understanding is that the pinned allocation is just a normal allocation followed by a `mlock` call. This should not be expensive (if you are not out of RAM).

I used nvprof. Percentages are with regards to the total time of all cuda api calls on the host. 0.14% 471.65us 2 235.83us 8.4520us 463.20us cudaMallocHost 0.09% 321.79us 2 160.89us 22.721us 299.07us cudaFreeHost So the total time for allocation and de-allocation is .7ms. One time step (without alloc/free) takes 0.28ms (with lb without lj).

fweik · 2019-04-02T13:43:22Z

Again: Can you please explain in more detail what the lifetime issue is? It's not totally clear to me.

RudolfWeeber · 2019-04-02T14:26:05Z

On Tue, Apr 02, 2019 at 06:43:22AM -0700, Florian Weik wrote: Again: Can you please explain in more detail what the lifetime issue is? It's not totally clear to me.

I'm not sure, I understand the question. If you are referring to where the need for the exit handler arises: The thrust allocator used in the vectors calls cudaFree at destruction to release the pinned memory. That fails, if the cuda runtime has been unloaded already. AFAIK, we cannot control the oder in which shared objects are unloaded, so Cuda might be unloaded before libEspressoCore. Then, the cudaFree()-call releasing the vectorss' memories throws an exception.

fweik · 2019-04-02T18:21:39Z

Wouldn't you have the same problem with free() from the shared object libc?

RudolfWeeber · 2019-04-03T11:42:56Z

It is not clear to me.
There is, as far as I can tell, no explicit call unloading the cuda driver in Espresso.
Also, I could not find a global whose destruciotn would entail the unloading.
So I cannot really think about an other explanation than the unloading order of shared objects.
One odd thing is that
libEspressoCore.so and libEsperssoCua.so
are actually not linked against libcudart, but just against libcufft:

linux-vdso.so.1 (0x00007ffebe3fa000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x0000145d6b140000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x0000145d6af3c000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x0000145d6ad34000)
libcufft.so.9.1 => /usr/lib/x86_64-linux-gnu/libcufft.so.9.1 (0x0000145d63847000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x0000145d634be000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000145d63120000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x0000145d62f08000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000145d62b17000)
/lib64/ld-linux-x86-64.so.2 (0x0000145d6b89b000)

fweik · 2019-04-03T12:11:28Z

It seems to me that I forgot linking to cudart:

(From /CMakeLists.txt)

        function(add_gpu_library)
          cuda_add_library(${ARGV})
          set_property(TARGET ${ARGV0} PROPERTY CUDA_SEPARABLE_COMPILATION ON)
          target_link_libraries(${ARGV0} PRIVATE ${CUDA_CUFFT_LIBRARIES})
        endfunction()

Could you please add it (like for the clang case) and check if this improves matters? (I probably thought that cuda_add_library does that automatically...)

RudolfWeeber · 2019-04-11T16:42:16Z

So libcudart was linked statically already. Linking it dynamically did not help.

It is possible to avoid the exit hanlder by storing the vectors containig the pinned memory in a wrapper class which explicitly creates a cuda context. An releases it after destructing the vectors.

This solution still needs work, e.g., for switching gpus. It also needs to be moved/integrated with cuda init and device switching.
I'll do that, if we go with this solution.

fweik · 2019-04-12T15:02:51Z

Ok I think we can go forward with this. In general I think we should seek to remove all the non-trivial globals rather sooner than later. We've had multiple issues with them.

fweik · 2019-04-27T09:55:28Z

@RudolfWeeber it seems like there is a file missing here (where the feature detection in python went), could you pleas add that? I am currently working on the async forward communication and want to integrate this, but it is not working.

fweik · 2019-07-03T10:25:42Z

@RudolfWeeber are you still looking into this?

RudolfWeeber · 2019-07-03T18:42:45Z

> <https://github.com/RudolfWeeber> @RudolfWeeber are you still looking into this?

IIRC, you said you wanted to integrate this into the non-blocking mpi-node -> gpu communication. If that’s not the case, I will look into it once I’m back from Paris. It also depends on the decision we take with regards to LBGPU. Even if we don’t do the full thing, the early starting of the GPU methods could be merged independently.

fweik · 2019-07-03T18:46:50Z

It seems to me that you forgot to check in some files.

…

On Wed, Jul 3, 2019, 20:42 RudolfWeeber ***@***.***> wrote: >> <https://github.com/RudolfWeeber> @RudolfWeeber are you still looking into this? IIRC, you said you wanted to integrate this into the non-blocking mpi-node -> gpu communication. If that’s not the case, I will look into it once I’m back from Paris. It also depends on the decision we take with regards to LBGPU. Even if we don’t do the full thing, the early starting of the GPU methods could be merged independently. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#2637?email_source=notifications&email_token=AAG2FX55ZFT2MQWMGV6QEFLP5TXKLA5CNFSM4HC6ND32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZFLEKI#issuecomment-508211753>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAG2FX5KIWFMRG7H7ODYHJLP5TXKLANCNFSM4HC6ND3Q> .

RudolfWeeber · 2019-07-03T20:13:00Z

It seems to me that you forgot to check in some files.

Even including commit afded8? What is still missing?

fweik · 2019-07-03T20:49:03Z

Ah maybe there is just an guard missing. I'll have a look

KaiSzuttor · 2020-03-03T11:14:44Z

@fweik is this obsolete by now?

fweik · 2020-03-03T12:00:19Z

No. Let's keep this open for now.

async GPU force/torque back-transfer, launch kernels earlier

e6cb5de

fweik self-assigned this Apr 2, 2019

fweik suggested changes Apr 2, 2019

View reviewed changes

KaiSzuttor added this to the Espresso 4.1 milestone Jul 15, 2019

RudolfWeeber modified the milestones: Espresso 4.1, Espresso 5 Sep 3, 2019

fweik changed the title ~~async GPU force/torque back-transfer, launch kernels earlier~~ WIP: async GPU force/torque back-transfer, launch kernels earlier Mar 3, 2020

fweik removed their assignment Dec 9, 2020

jngrad removed this from the Espresso 5 milestone May 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: async GPU force/torque back-transfer, launch kernels earlier #2637

WIP: async GPU force/torque back-transfer, launch kernels earlier #2637

RudolfWeeber commented Apr 2, 2019

fweik Apr 2, 2019 •

edited

Loading

fweik Apr 2, 2019

RudolfWeeber commented Apr 2, 2019

fweik commented Apr 2, 2019

fweik commented Apr 2, 2019 •

edited

Loading

RudolfWeeber commented Apr 2, 2019 via email

RudolfWeeber commented Apr 2, 2019 via email

fweik commented Apr 2, 2019

RudolfWeeber commented Apr 2, 2019 via email

fweik commented Apr 2, 2019

RudolfWeeber commented Apr 3, 2019

fweik commented Apr 3, 2019 •

edited

Loading

RudolfWeeber commented Apr 11, 2019

fweik commented Apr 12, 2019

fweik commented Apr 27, 2019

fweik commented Jul 3, 2019

RudolfWeeber commented Jul 3, 2019 via email

fweik commented Jul 3, 2019 via email

RudolfWeeber commented Jul 3, 2019 via email

fweik commented Jul 3, 2019

KaiSzuttor commented Mar 3, 2020

fweik commented Mar 3, 2020

WIP: async GPU force/torque back-transfer, launch kernels earlier #2637

Are you sure you want to change the base?

WIP: async GPU force/torque back-transfer, launch kernels earlier #2637

Conversation

RudolfWeeber commented Apr 2, 2019

fweik Apr 2, 2019 • edited Loading

Choose a reason for hiding this comment

fweik Apr 2, 2019

Choose a reason for hiding this comment

RudolfWeeber commented Apr 2, 2019

fweik commented Apr 2, 2019

fweik commented Apr 2, 2019 • edited Loading

RudolfWeeber commented Apr 2, 2019 via email

RudolfWeeber commented Apr 2, 2019 via email

fweik commented Apr 2, 2019

RudolfWeeber commented Apr 2, 2019 via email

fweik commented Apr 2, 2019

RudolfWeeber commented Apr 3, 2019

fweik commented Apr 3, 2019 • edited Loading

RudolfWeeber commented Apr 11, 2019

fweik commented Apr 12, 2019

fweik commented Apr 27, 2019

fweik commented Jul 3, 2019

RudolfWeeber commented Jul 3, 2019 via email

fweik commented Jul 3, 2019 via email

RudolfWeeber commented Jul 3, 2019 via email

fweik commented Jul 3, 2019

KaiSzuttor commented Mar 3, 2020

fweik commented Mar 3, 2020

fweik Apr 2, 2019 •

edited

Loading

fweik commented Apr 2, 2019 •

edited

Loading

fweik commented Apr 3, 2019 •

edited

Loading