-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GEMM device function #880
base: develop
Are you sure you want to change the base?
Conversation
I p, | ||
T alpha, | ||
const T *A, | ||
I inc_A, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the inc_A, inc_B, ... etc the same as "stride" like strideA? If not, perhaps some description would be helpful. Just a suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added function documentation, thanks for the suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if there are tests scripts or updates to rocsolver-bench and/or rocsolver-test to check for correctness (and perhaps performance). Note check for m,n,k to be nb, nb+1, nb-1 corner cases as well.
Will other BLAS operations, say TRMM, or SYR2K, also take advantage of the new GEMM? One may consider a conceptually recursive formulation so that it can use GEMM.
Hi @AGonzales-amd, I second Ed's suggestion: you should update |
Thanks @jmachado-amd, I could use your help. I did consider updating the client programs but I had trouble exporting the function or making it visible in the clients. |
Hi @jmachado-amd and @EdDAzevedo, the client programs have been updated to support internal gemm. One thing I'm not sure about is the tolerance for error checking. |
Hi @AGonzales-amd, as long as the input matrices are "small" to "medium" sized and have positive, relatively small integer entries, the current test tolerance will work just fine! Let me know if you want to generalize the tests or just understand how those bounds work, and I can explain the important bits of the theory to you. On another topic, I see that there are many |
This adds a gemm device function which is callable in other kernels. The function is designed to be called by an entire wavefront and to compute a block of the output matrix. Therefore, problems can be decomposed into chunks that are operated on by individual wavefronts. Currently, it is used to implement an alternative rocsolver_gemm kernel.
__gfx90a__
,__gfx940__
,__gfx941__
, and__gfx942__
architectures