Dot product of a complex CuArray with a real CuArray performance #668

coezmaden · 2021-01-20T16:46:05Z

Describe the bug

Dot product of a complex CuArray with a real CuArray: memory pre-allocated function is slower than one that allocates memory.

To reproduce

The Minimal Working Example (MWE) for this bug:

using CUDA, BenchmarkTools, LinearAlgebra

N = 10000
a = CUDA.ones(Float32, N)
b = CUDA.ones(ComplexF32, N)
b_re = real.(b)
b_im = imag.(b)

function dot_complex(a::CuArray{Float32}, b::CuArray{ComplexF32})
    dot(complex.(a, CUDA.zeros(length(a))), b)
end

function dot_real(a::CuArray{Float32}, b_re::CuArray{Float32}, b_im::CuArray{Float32})
   complex.(dot(a, b_re), dot(a, b_im))
end

@btime CUDA.@sync dot_complex($a, $b) #60.400 μs (45 allocations: 1.02 KiB)
@btime CUDA.@sync dot_real($a,$b_re,$b_im) #76.700 μs (17 allocations: 288 bytes)

Manifest.toml

[[CUDA]]
deps = ["AbstractFFTs", "Adapt", "BFloat16s", "CEnum", "CompilerSupportLibraries_jll", "DataStructures", "ExprTools", "GPUArrays", "GPUCompiler", "LLVM", "Libdl", "LinearAlgebra", "Logging", "MacroTools", "NNlib", "Pkg", "Printf", "Random", "Reexport", "Requires", "SparseArrays", "Statistics", "TimerOutputs"]
git-tree-sha1 = "39f6f584bec264ace76f924d1c8637c85617697e"
uuid = "052768ef-5323-5732-b1bb-66c8b64840ba"
version = "2.4.0"

[[GPUArrays]]
deps = ["AbstractFFTs", "Adapt", "LinearAlgebra", "Printf", "Random", "Serialization"]
git-tree-sha1 = "f99a25fe0313121f2f9627002734c7d63b4dd3bd"
uuid = "0c68f7d7-f131-5f86-a1c3-88cf8149b2d7"
version = "6.2.0"

[[LLVM]]
deps = ["CEnum", "Libdl", "Printf", "Unicode"]
git-tree-sha1 = "d0d99629d6ae4a3e211ae83d8870907bd842c811"
uuid = "929cbde3-209d-540e-8aea-75f648917ca0"
version = "3.5.2"

Expected behavior

Would expect dot_real to be faster since no memory is allocated during runtime.

Version info

Details on Julia:

Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = "C:\Program Files\Microsoft VS Code\Code.exe"
  JULIA_NUM_THREADS =

Details on CUDA:

# please post the output of:
CUDA toolkit 11.1.1, artifact installation
CUDA driver 11.1.0
NVIDIA driver 456.71.0

Libraries:
- CUBLAS: 11.3.0
- CURAND: 10.2.2
- CUFFT: 10.3.0
- CUSOLVER: 11.0.1
- CUSPARSE: 11.3.0
- CUPTI: 14.0.0
- NVML: 11.0.0+451.22
- CUDNN: 8.0.4 (for CUDA 11.1.0)
- CUTENSOR: 1.2.1 (for CUDA 11.1.0)

Toolchain:
- Julia: 1.5.3
- LLVM: 9.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75

1 device:
  0: GeForce GTX 1050 Ti (sm_61, 3.216 GiB / 4.000 GiB available)

Additional context
Originally posted as #667
Please excuse for the double-issue.
I have reduced the problem to pure CUDA.jl without StructArrays.jl considerations.

The text was updated successfully, but these errors were encountered:

maleadt · 2021-01-21T10:51:22Z

This is expected. Memory allocations are cached in a pool, so they can be fulfilled asynchronously. However, calling dot returns a scalar, so that needs to synchronize the GPU. As a result, the complex case only needs to launch & sync once, while the real one performs two dot kernel launches and subsequent synchronizations. This is clear under the profiler:

Note 1: running under the profiler adds some overhead, so small timings can be different. Without the profiler, I'm getting 19 vs 21 us, so the complex case is still faster. Looking at the trace though, I realized the memset could be async too, and the complex case gets even faster that way (17 us).

Note 2: if you bump your problem size to, say, N=1000000, the real case becomes faster: 65us vs 107us. The reason is twofold: the launch overhead is now dwarfed by kernel execution time, and time it takes to execute the broadcast kernel (complex.(...)) also becomes nontrivial.

Note 3: if you insist on the small problem size, it's better to use CUDA.@sync blocking=false.

Note 4: you can always implement your own dot to benefit from your specific problem here. For example:

function dot_mapreduce(a::CuArray{Float32}, b_re::CuArray{Float32}, b_im::CuArray{Float32})
    mapreduce(+, a, b_re, b_im) do x, y, z
        complex(x*y, x*z)
    end
end

On N=1000000, that results in an execution time of 56us here (vs 65us for the real case, 107us for the complex case). Again, the dynamics here aren't straightforward, but in general this approach avoids both a (costly) broadcast to convert the input, and doesn't require multiple (synchronizing) invocations of dot:

kshyatt · 2025-02-06T19:11:19Z

This should be fixed once #2616 is in, right?

maleadt · 2025-02-07T09:10:57Z

I don't think so, we still synchronize before returning the scalar. It's up to the user to call the asynchronous version of that API, which we'll provide in #2616, but was also possible already with the suggestions I posted above. So I think we can close this.

coezmaden added the bug Something isn't working label Jan 20, 2021

coezmaden mentioned this issue Jan 20, 2021

Complex dot product performance of CuArrays and of StructArrays of CuArrays #667

Closed

maleadt removed the bug Something isn't working label Jan 21, 2021

maleadt closed this as completed Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dot product of a complex CuArray with a real CuArray performance #668

Dot product of a complex CuArray with a real CuArray performance #668

coezmaden commented Jan 20, 2021

maleadt commented Jan 21, 2021

kshyatt commented Feb 6, 2025

maleadt commented Feb 7, 2025

Dot product of a complex CuArray with a real CuArray performance #668

Dot product of a complex CuArray with a real CuArray performance #668

Comments

coezmaden commented Jan 20, 2021

maleadt commented Jan 21, 2021

kshyatt commented Feb 6, 2025

maleadt commented Feb 7, 2025