-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
should not remove predicate for 1d TMA #3899
base: main
Are you sure you want to change the base?
Conversation
…zation' into llu/tma_predicate
Review updated until commit 5b2a40d Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
!test |
…nto llu/tma_predicate
!test |
What?
CpAsyncBulkTensorTile
), not 1d TMACpAsyncBulk
.Why?
CpAsyncBulkTensorTile
handles out of bound accesses automatically in hardware, no need to predicate it.However,
CpAsyncBulk
still needs predicate.ref
the memory range [srcMem, srcMem + size - 1] must not overflow the source memory space. Otherwise, the behavior is undefined
Difficulties to handle circualr buffer
We need to predicate both
CpAsyncBulk
andwaitParity
to avoid out of bound access and deadlock. For pre-log (prefetch 2), we can change to the following, whereb26
is a newly added predicate to avoid out of bound access.For main-loop, the change is similar except for the extra predicate of
wait parity
. The predicate forwait parity
is assuming the split by stages is divisible, otherwise, it is not correct. This is the first challenge of adding predicate.For epilog, there is no
cpAsyncBulkG2S
, we need to create the predicate using other reference tensors, e.g. output tensor. This is another challenge of adding predicate.Other approaches:
The WAR in #3917 is much simplier. It uses a mod op to aviod out of bound access, the redundant tma load only happens for the tail SMs in the last iteration.