-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dot product of a complex CuArray with a real CuArray performance #668
Comments
This is expected. Memory allocations are cached in a pool, so they can be fulfilled asynchronously. However, calling Note 1: running under the profiler adds some overhead, so small timings can be different. Without the profiler, I'm getting 19 vs 21 us, so the complex case is still faster. Looking at the trace though, I realized the Note 2: if you bump your problem size to, say, N=1000000, the real case becomes faster: 65us vs 107us. The reason is twofold: the launch overhead is now dwarfed by kernel execution time, and time it takes to execute the Note 3: if you insist on the small problem size, it's better to use Note 4: you can always implement your own function dot_mapreduce(a::CuArray{Float32}, b_re::CuArray{Float32}, b_im::CuArray{Float32})
mapreduce(+, a, b_re, b_im) do x, y, z
complex(x*y, x*z)
end
end On N=1000000, that results in an execution time of 56us here (vs 65us for the real case, 107us for the complex case). Again, the dynamics here aren't straightforward, but in general this approach avoids both a (costly) |
This should be fixed once #2616 is in, right? |
I don't think so, we still synchronize before returning the scalar. It's up to the user to call the asynchronous version of that API, which we'll provide in #2616, but was also possible already with the suggestions I posted above. So I think we can close this. |
Describe the bug
Dot product of a complex CuArray with a real CuArray: memory pre-allocated function is slower than one that allocates memory.
To reproduce
The Minimal Working Example (MWE) for this bug:
Manifest.toml
Expected behavior
Would expect dot_real to be faster since no memory is allocated during runtime.
Version info
Details on Julia:
Details on CUDA:
Additional context
Originally posted as #667
Please excuse for the double-issue.
I have reduced the problem to pure CUDA.jl without StructArrays.jl considerations.
The text was updated successfully, but these errors were encountered: