Replies: 2 comments 6 replies
-
It is not going to work out of box. If m and n are huge, I don't think you can completely fuse two kernels. Though, every threadblock can do its own part of dot product. A second kernel can be used to do the final reduction. |
Beta Was this translation helpful? Give feedback.
2 replies
-
Thanks a lot! If you don't mind, I have a few more questions to ask: Here is a pseudo-code (pythonic) to the algorithm: x_aggregate = []
for i, x_slice in enumerate(x_slices):
y_aggregate = []
for y_slice in y_slices:
A_block = f(x_slice, y_slice)
A_block_v = A_block @ v[i:i+slice_size]
y_aggregate.append(A_block_v)
A_v = sum(y_aggregate)
x_aggregate.append(A_v)
full_A_v = concat(x_aggregate)
|
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
I'm curious if CUTLASS allows implicit matrix-vector multiplication. Specifically, with a kernel function
f: (ℝᵈ, ℝᵈ) -> ℝᵈ
, can two vectorsx ∈ ℝⁿˣᵈ, y ∈ ℝᵐˣᵈ
be used as inputs to produce an output matrixA ∈ ℝⁿˣᵐ
and then perform matrix-vector multiplication with a vectorv ∈ ℝᵐ
without materializing the matrixA
? In use cases like the Conjugate Gradient algorithm, where a large matrix (n
andm
> 1e6) is produced and multiplied by a vectorv
, it would be impractical to compute the matrix before the matrix-vector operation. An alternative approach would be to split or tile one of the inputs,x
, into blocks ofb
size, and calculateb ⨉ m
parts of matrixA
while performing the matrix-vector multiplication.Does CUTLASS support these operations directly, or would I need to use primitives to implement them? The challenge is that function
f
could be anything, though currently I'm looking at the Squared Exponential kernel.Thanks!
Beta Was this translation helpful? Give feedback.
All reactions