Activity
[AMD] Update smem size for cdna4
[AMD] Update smem size for cdna4
* moved Membar to allocate-shared-memory
* moved Membar to allocate-shared-memory
Force push
[AMD] Use warp shuffle for fp8 MFMA to dot operand layout conversion (t…
[AMD] Use warp shuffle for fp8 MFMA to dot operand layout conversion (t…
Force warpsPerCTA={1, numWarps} when BLOCK_M=mDim
Force warpsPerCTA={1, numWarps} when BLOCK_M=mDim
Deleted branch
Skip scalar and 1D tensor load for sinkSecondLoad
Skip scalar and 1D tensor load for sinkSecondLoad
Force push
Skip scalar and 1D tensor load for sinkSecondLoad
Skip scalar and 1D tensor load for sinkSecondLoad
Merge pull request #721 from ROCm/dtanner/dev-refine-ops
Merge pull request #721 from ROCm/dtanner/dev-refine-ops
Pull request merge
[AMD] enhanced dep-graph printing
[AMD] enhanced dep-graph printing
Add workaround for pytorch device selection issue (#711)
Add workaround for pytorch device selection issue (#711)
[AMD] re-worked scheduling loop in the machine model
[AMD] re-worked scheduling loop in the machine model
[AMD] Addressed some comments from PR #724 (fork)
[AMD] Addressed some comments from PR #724 (fork)
fix
fix
20 hours ago
fix
fix
20 hours ago
persistent approach brings benefit with larger batch sizes with this …
persistent approach brings benefit with larger batch sizes with this …
Deleted branch
Deleted branch
Deleted branch
Deleted branch
Deleted branch
Deleted branch
Deleted branch
Deleted branch
Deleted branch
Deleted branch
Deleted branch
Deleted branch
Deleted branch
Addressing review improvements.
Addressing review improvements.
Merge remote-tracking branch 'origin/main' into sjw/global-local-pref…
Merge remote-tracking branch 'origin/main' into sjw/global-local-pref…
Force push