Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GEMM device function #880

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

AGonzales-amd
Copy link
Contributor

@AGonzales-amd AGonzales-amd commented Jan 13, 2025

This adds a gemm device function which is callable in other kernels. The function is designed to be called by an entire wavefront and to compute a block of the output matrix. Therefore, problems can be decomposed into chunks that are operated on by individual wavefronts. Currently, it is used to implement an alternative rocsolver_gemm kernel.

  • Uses mfma instructions
  • Limited to __gfx90a__, __gfx940__, __gfx941__, and __gfx942__ architectures
  • Supports both complex data types and transpose matrix operations

@AGonzales-amd AGonzales-amd marked this pull request as ready for review January 15, 2025 22:20
I p,
T alpha,
const T *A,
I inc_A,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the inc_A, inc_B, ... etc the same as "stride" like strideA? If not, perhaps some description would be helpful. Just a suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added function documentation, thanks for the suggestion

Copy link
Contributor

@EdDAzevedo EdDAzevedo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if there are tests scripts or updates to rocsolver-bench and/or rocsolver-test to check for correctness (and perhaps performance). Note check for m,n,k to be nb, nb+1, nb-1 corner cases as well.

Will other BLAS operations, say TRMM, or SYR2K, also take advantage of the new GEMM? One may consider a conceptually recursive formulation so that it can use GEMM.

@AGonzales-amd AGonzales-amd added the noOptimizations Disable optimized kernels for small sizes for some routines label Jan 16, 2025
@jmachado-amd
Copy link
Collaborator

Hi @AGonzales-amd, I second Ed's suggestion: you should update rocsolver-test and -bench clients to support the internal gemm. I can help you with the tests if you want (this would also be a good opportunity to provide a concrete answer to the question you asked in #879).

@AGonzales-amd
Copy link
Contributor Author

Hi @AGonzales-amd, I second Ed's suggestion: you should update rocsolver-test and -bench clients to support the internal gemm. I can help you with the tests if you want (this would also be a good opportunity to provide a concrete answer to the question you asked in #879).

Thanks @jmachado-amd, I could use your help. I did consider updating the client programs but I had trouble exporting the function or making it visible in the clients.

@AGonzales-amd
Copy link
Contributor Author

Hi @jmachado-amd and @EdDAzevedo, the client programs have been updated to support internal gemm. One thing I'm not sure about is the tolerance for error checking.

@jmachado-amd jmachado-amd added the ci:no-ccache Disable ccache label Jan 23, 2025
@jmachado-amd
Copy link
Collaborator

Hi @AGonzales-amd, as long as the input matrices are "small" to "medium" sized and have positive, relatively small integer entries, the current test tolerance will work just fine! Let me know if you want to generalize the tests or just understand how those bounds work, and I can explain the important bits of the theory to you.

On another topic, I see that there are many gemm tests failing in Windows, you probably want to have a look at that sooner rather than later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci:no-ccache Disable ccache noOptimizations Disable optimized kernels for small sizes for some routines
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants