Attention to better MLPerf and beyond #108

raikonenfnu · 2024-11-19T23:53:05Z

Modifying kWidth to maximize reads from shared memory
Modifying kWidth S.T FP8 do not need trip to shared memory.
Enable attention transposeV when possible (in progress)
Dot slicing for better instruction scheduling
Buffer loads for free masking and move K,V directly from global to shared memory
Instruction scheduling / software pipelining to overlap MMA and softmax
Prefetch/MultiBuffering
Try dot3d/ single kernel split-K to get faster attention on decode phase

Provide feedback