Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attention to better MLPerf and beyond #108

Open
2 of 8 tasks
raikonenfnu opened this issue Nov 19, 2024 · 0 comments
Open
2 of 8 tasks

Attention to better MLPerf and beyond #108

raikonenfnu opened this issue Nov 19, 2024 · 0 comments

Comments

@raikonenfnu
Copy link
Member

  1. General Attention Health
  • Modifying kWidth to maximize reads from shared memory
  • Modifying kWidth S.T FP8 do not need trip to shared memory.
  • Enable attention transposeV when possible (in progress)
  • Dot slicing for better instruction scheduling
  • Buffer loads for free masking and move K,V directly from global to shared memory
  • Instruction scheduling / software pipelining to overlap MMA and softmax
  • Prefetch/MultiBuffering
  • Try dot3d/ single kernel split-K to get faster attention on decode phase
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant