Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dot product of a complex CuArray with a real CuArray performance #668

Closed
coezmaden opened this issue Jan 20, 2021 · 3 comments
Closed

Dot product of a complex CuArray with a real CuArray performance #668

coezmaden opened this issue Jan 20, 2021 · 3 comments

Comments

@coezmaden
Copy link

Describe the bug

Dot product of a complex CuArray with a real CuArray: memory pre-allocated function is slower than one that allocates memory.

To reproduce

The Minimal Working Example (MWE) for this bug:

using CUDA, BenchmarkTools, LinearAlgebra

N = 10000
a = CUDA.ones(Float32, N)
b = CUDA.ones(ComplexF32, N)
b_re = real.(b)
b_im = imag.(b)

function dot_complex(a::CuArray{Float32}, b::CuArray{ComplexF32})
    dot(complex.(a, CUDA.zeros(length(a))), b)
end

function dot_real(a::CuArray{Float32}, b_re::CuArray{Float32}, b_im::CuArray{Float32})
   complex.(dot(a, b_re), dot(a, b_im))
end

@btime CUDA.@sync dot_complex($a, $b) #60.400 μs (45 allocations: 1.02 KiB)
@btime CUDA.@sync dot_real($a,$b_re,$b_im) #76.700 μs (17 allocations: 288 bytes)
Manifest.toml

[[CUDA]]
deps = ["AbstractFFTs", "Adapt", "BFloat16s", "CEnum", "CompilerSupportLibraries_jll", "DataStructures", "ExprTools", "GPUArrays", "GPUCompiler", "LLVM", "Libdl", "LinearAlgebra", "Logging", "MacroTools", "NNlib", "Pkg", "Printf", "Random", "Reexport", "Requires", "SparseArrays", "Statistics", "TimerOutputs"]
git-tree-sha1 = "39f6f584bec264ace76f924d1c8637c85617697e"
uuid = "052768ef-5323-5732-b1bb-66c8b64840ba"
version = "2.4.0"

[[GPUArrays]]
deps = ["AbstractFFTs", "Adapt", "LinearAlgebra", "Printf", "Random", "Serialization"]
git-tree-sha1 = "f99a25fe0313121f2f9627002734c7d63b4dd3bd"
uuid = "0c68f7d7-f131-5f86-a1c3-88cf8149b2d7"
version = "6.2.0"

[[LLVM]]
deps = ["CEnum", "Libdl", "Printf", "Unicode"]
git-tree-sha1 = "d0d99629d6ae4a3e211ae83d8870907bd842c811"
uuid = "929cbde3-209d-540e-8aea-75f648917ca0"
version = "3.5.2"

Expected behavior

Would expect dot_real to be faster since no memory is allocated during runtime.

Version info

Details on Julia:

Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = "C:\Program Files\Microsoft VS Code\Code.exe"
  JULIA_NUM_THREADS =

Details on CUDA:

# please post the output of:
CUDA toolkit 11.1.1, artifact installation
CUDA driver 11.1.0
NVIDIA driver 456.71.0

Libraries:
- CUBLAS: 11.3.0
- CURAND: 10.2.2
- CUFFT: 10.3.0
- CUSOLVER: 11.0.1
- CUSPARSE: 11.3.0
- CUPTI: 14.0.0
- NVML: 11.0.0+451.22
- CUDNN: 8.0.4 (for CUDA 11.1.0)
- CUTENSOR: 1.2.1 (for CUDA 11.1.0)

Toolchain:
- Julia: 1.5.3
- LLVM: 9.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75

1 device:
  0: GeForce GTX 1050 Ti (sm_61, 3.216 GiB / 4.000 GiB available)

Additional context
Originally posted as #667
Please excuse for the double-issue.
I have reduced the problem to pure CUDA.jl without StructArrays.jl considerations.

@maleadt
Copy link
Member

maleadt commented Jan 21, 2021

This is expected. Memory allocations are cached in a pool, so they can be fulfilled asynchronously. However, calling dot returns a scalar, so that needs to synchronize the GPU. As a result, the complex case only needs to launch & sync once, while the real one performs two dot kernel launches and subsequent synchronizations. This is clear under the profiler:

image

Note 1: running under the profiler adds some overhead, so small timings can be different. Without the profiler, I'm getting 19 vs 21 us, so the complex case is still faster. Looking at the trace though, I realized the memset could be async too, and the complex case gets even faster that way (17 us).

Note 2: if you bump your problem size to, say, N=1000000, the real case becomes faster: 65us vs 107us. The reason is twofold: the launch overhead is now dwarfed by kernel execution time, and time it takes to execute the broadcast kernel (complex.(...)) also becomes nontrivial.

Note 3: if you insist on the small problem size, it's better to use CUDA.@sync blocking=false.

Note 4: you can always implement your own dot to benefit from your specific problem here. For example:

function dot_mapreduce(a::CuArray{Float32}, b_re::CuArray{Float32}, b_im::CuArray{Float32})
    mapreduce(+, a, b_re, b_im) do x, y, z
        complex(x*y, x*z)
    end
end

On N=1000000, that results in an execution time of 56us here (vs 65us for the real case, 107us for the complex case). Again, the dynamics here aren't straightforward, but in general this approach avoids both a (costly) broadcast to convert the input, and doesn't require multiple (synchronizing) invocations of dot:

image

@maleadt maleadt removed the bug Something isn't working label Jan 21, 2021
@kshyatt
Copy link
Contributor

kshyatt commented Feb 6, 2025

This should be fixed once #2616 is in, right?

@maleadt
Copy link
Member

maleadt commented Feb 7, 2025

I don't think so, we still synchronize before returning the scalar. It's up to the user to call the asynchronous version of that API, which we'll provide in #2616, but was also possible already with the suggestions I posted above. So I think we can close this.

@maleadt maleadt closed this as completed Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants