Implicit matrix-vector multiplication #1098

awav · 2023-09-16T20:18:17Z

awav
Sep 16, 2023

Hi,

I'm curious if CUTLASS allows implicit matrix-vector multiplication. Specifically, with a kernel function f: (ℝᵈ, ℝᵈ) -> ℝᵈ, can two vectors x ∈ ℝⁿˣᵈ, y ∈ ℝᵐˣᵈ be used as inputs to produce an output matrix A ∈ ℝⁿˣᵐ and then perform matrix-vector multiplication with a vector v ∈ ℝᵐ without materializing the matrix A? In use cases like the Conjugate Gradient algorithm, where a large matrix (n and m > 1e6) is produced and multiplied by a vector v, it would be impractical to compute the matrix before the matrix-vector operation. An alternative approach would be to split or tile one of the inputs, x, into blocks of b size, and calculate b ⨉ m parts of matrix A while performing the matrix-vector multiplication.

Does CUTLASS support these operations directly, or would I need to use primitives to implement them? The challenge is that function f could be anything, though currently I'm looking at the Squared Exponential kernel.

Thanks!

hwu36 · 2023-09-18T15:31:05Z

hwu36
Sep 18, 2023
Maintainer

It is not going to work out of box. If m and n are huge, I don't think you can completely fuse two kernels. Though, every threadblock can do its own part of dot product. A second kernel can be used to do the final reduction.

2 replies

awav Sep 19, 2023
Author

@hwu36, To be fair, I wasn't really expecting it to work out of the box :). The task at hand might not have a straightforward solution.

Could you provide guidance on implementing the following algorithm with CUTLAS? Furthermore, could you confirm if CUTLAS is the right tool for this task?

Let's define notation again: x, y, and v in the ℝⁿˣᵈ, ℝᵐˣᵈ, and ℝᵐ arrays respectively. The goal is to calculate A v, where A ∈ ℝⁿˣᵐ is a matrix constructed using the function f: ℝᵈ ⨉ ℝᵈ -> ℝ, and elements of A are Aᵢⱼ = f(xᵢ, yⱼ), with n and m being large enough that the entire matrix A doesn't fit in the GPU memory.

The naive algorithm:

Slice x and y to compute a block of matrix A using function f.
Multiply that block with a corresponding slice of vector v.
Aggregate all vectors from step 2 for a consistent slice of x, resulting in a final slice of the output of A v.
Repeat 1-3 for all slices of x and y

And there are two scenarios for GPU implementation:

Vectors x , y, and v fit into the GPUs memory. Would I need to write a loop and also specify the strategy for splitting into slices of the "right" size. Can CUTLASS perform slicing strategy automatically?
Vectors x , y, and v do not fit into the GPUs memory. Here is basically the same question as in the scenario 1, but with extra complexity about host-device memory transfers.

Thanks!

hwu36 Sep 19, 2023
Maintainer

cutlass focuses on in-device single gpu gemm/conv, but I think you can use cutlass as a building block for your out-of-device problem sizes though you need to write your own slicing code.

The algorithm you mentioned makes sense. you can use gemm_with_reduction (https://github.com/NVIDIA/cutlass/blob/main/test/unit/gemm/device/gemm_with_reduction_f16t_f16n_f16n_tensorop_f32_sm80.cu). Or you can use newly released epilogue-visitor-tree as the building blocks.

awav · 2023-09-20T13:36:14Z

awav
Sep 20, 2023
Author

Thanks a lot! If you don't mind, I have a few more questions to ask:

Here is a pseudo-code (pythonic) to the algorithm:

x_aggregate = []
for i, x_slice in enumerate(x_slices):
    y_aggregate = []
    for y_slice in y_slices:
         A_block = f(x_slice, y_slice)
         A_block_v = A_block @ v[i:i+slice_size]
         y_aggregate.append(A_block_v)
    A_v = sum(y_aggregate)
    x_aggregate.append(A_v)
full_A_v = concat(x_aggregate)

In case of splitting. Would I need to derive the optimal strategy of splitting / tiling (i.e. slicing of x and y) depending on characteristics of GPU, e.g. available global memory size, shared memory size, number of SMs and etc.?
Does CUTLASS efficiently schedule the execution of the loop's inner body? Each iteration is independent, and it seems there could be opportunities to exploit this fact.

4 replies

hwu36 Sep 20, 2023
Maintainer

as to 1, yes, you want each slice's input and output data to fit in your gpu memory. If you want to pipeline memcpy and computing, you need to think about double (circular) buffer too. You don't need to worry about shared memory size though. as to sm number, yes if want to push the performance to the limit. you want your total threadblock number of each slice to be the multiple of SMs.

as to 2, what kind of schedule do you mean?

awav Sep 21, 2023
Author

Sorry, I didn't clarify that for now I'm considering only the case when x, y and v fit into memory. So, I would expect that there would be no necessity to do memcpy'ing. Storing full matrix A is impossible on GPU.
By schedule I mean running iterations of the loop in parallel automatically, because iterations are independent of each other (at least the inner loop of the pseudo-algorithm). I though that CUTLASS provides some primitives such as iterator, which can behave as a spawner for new computation. I.e. if there are available resources then the computation will be run, otherwise wait in the queue.

Forgot to ask:

What is epilogue-visitor-tree you mentioned earlier?
As far as I can see there is a python interface to CUTLASS. Can I prototype the proposed algorithm using this interface?

hwu36 Sep 21, 2023
Maintainer

one iteration of your inner loop is a complete gemm, correct? if you think that single gemm will not saturate your gpu, you can use batch gemm to run multiple ones at the same time to keep the device busy.

What is epilogue-visitor-tree you mentioned earlier?
As far as I can see there is a python interface to CUTLASS. Can I prototype the proposed algorithm using this interface?

@jackkosaian

jackkosaian Sep 21, 2023

Some background and an example of epilogue visitor tree is available here. Feel free to follow up if you have additional questions.

Regarding the CUTLASS Python interface: the CUTLASS Python interface currently only supports running CUTLASS's device-level GEMMs via Python, so you would not get the sort of fine-grained control that I think you're looking for (e.g., at block level).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implicit matrix-vector multiplication #1098

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Implicit matrix-vector multiplication #1098

awav Sep 16, 2023

Replies: 2 comments · 6 replies

hwu36 Sep 18, 2023 Maintainer

awav Sep 19, 2023 Author

hwu36 Sep 19, 2023 Maintainer

awav Sep 20, 2023 Author

hwu36 Sep 20, 2023 Maintainer

awav Sep 21, 2023 Author

hwu36 Sep 21, 2023 Maintainer

jackkosaian Sep 21, 2023

awav
Sep 16, 2023

Replies: 2 comments 6 replies

hwu36
Sep 18, 2023
Maintainer

awav Sep 19, 2023
Author

hwu36 Sep 19, 2023
Maintainer

awav
Sep 20, 2023
Author

hwu36 Sep 20, 2023
Maintainer

awav Sep 21, 2023
Author

hwu36 Sep 21, 2023
Maintainer