Faster Custom Paged Attention kernels #372

sanyalington · 2025-01-21T15:18:11Z

Faster CPA kernel based on mfma16x16x16 instructions.
Performance same as the numbers mentioned in #347
Fixed tests/kernels/test_attention.py::test_paged_attention unit test for CPA rocm kernel

gshtras · 2025-01-21T18:33:42Z

tests/kernels/test_attention.py

@@ -117,7 +117,7 @@ def ref_single_query_cached_kv_attention(

 @pytest.mark.parametrize(
    "version",
-    ["v1", "v2"] if not current_platform.is_rocm() else ["v1", "v2", "rocm"])
+    ["v1", "v2"] if not current_platform.is_rocm() else ["rocm"])


Don't we still want to run unit tests on v1/v2? These backends are still supported

@gshtras Yes - but v1,v2 unit tests are broken - I saw errors when I tried running them.

We have suggested fixes to the v1, v2 unit tests in this PR (#383)

Let's re-target #383 to this branch and combine them

gshtras · 2025-01-21T18:35:38Z

csrc/rocm/attention.cu

 }

 #define CALL_CUSTOM_LAUNCHER(T, KVT, KV_DTYPE, BLK_SIZE, HEAD_SIZE, OUTT,      \
-                             PSIZE)                                            \
+                             PSIZE, ALIBI_ENABLED)                                            \


This tells me that the CI linters are broken...

I missed running clang-format, let me do that.

On the plus side, this exposed the CI linters issue :)
We'll hopefully get them back soon

…egration

Ported ROCm/vllm changes to upstream vLLM This commit manually ports changes from ROCm/vllm (ROCm#372) to upstream vLLM. The original work was done by sanyalington. Co-authored-by: sanyalington <[email protected]>

Ported ROCm/vllm changes to upstream vLLM This commit manually ports changes from ROCm/vllm (ROCm#372) to upstream vLLM. The original work was done by sanyalington. Co-authored-by: sanyalington <[email protected]> Signed-off-by: vllmellm <[email protected]>

shajrawi

Approving based on latest accuracy numbers.

* [Bugfix]: fix paged attention tests based on the updated kernels in `csrc/attention/paged_attention_v1.cu`,`csrc/attention/paged_attention_v2.cu` and `csrc/rocm/attention.cu`. * improve code documentation. * lint --------- Co-authored-by: vllmellm <[email protected]>

…egration

sanyalington added 4 commits January 17, 2025 18:25

integrate new cpa kernel, update tests and benchmark

42d4c5c

added comments to mfma4 kernel

ba5df8b

further comments for mfma16 kernel

4fdcd75

Merge branch 'main' into shsanyal_cpa_main_integration

bf47251

sanyalington requested a review from gshtras January 21, 2025 16:08

gshtras reviewed Jan 21, 2025

View reviewed changes

gshtras and others added 5 commits January 21, 2025 16:14

Merge branch 'main' into shsanyal_cpa_main_integration

45945a7

Merge remote-tracking branch 'origin/main' into shsanyal_cpa_main_int…

72cf38e

…egration

clang-format

6f8e708

Lint

7aa547f

Merge branch 'main' into shsanyal_cpa_main_integration

62755f3

tjtanaa mentioned this pull request Jan 23, 2025

[ROCm] Faster Custom Paged Attention kernels vllm-project/vllm#12348

Open

add flag for logits rtz conversion and disable by default

4d0166c

tjtanaa mentioned this pull request Jan 23, 2025

[Bugfix]: Fix paged attention unit tests #383

Closed

lint

52a0b95

This was referenced Jan 24, 2025

[Feature] Faster Custom Paged Attention kernels #385

Draft

[FEAT] Improved PagedAttention FP8 (faster kvcache dequant v1) #346

Closed

[FEAT] Improved PagedAttention FP8 (faster kvcache dequant v2) #347

Closed

shajrawi previously approved these changes Jan 24, 2025

View reviewed changes

Merge branch 'main' into shsanyal_cpa_main_integration

5510e8c

tjtanaa mentioned this pull request Jan 27, 2025

[Bugfix]: Fix paged attention unit tests of https://github.com/ROCm/vllm/pull/372 #389

Merged

gshtras dismissed shajrawi’s stale review via 49dfc1d January 28, 2025 17:22

Merge remote-tracking branch 'origin/main' into shsanyal_cpa_main_int…

e2fe1db

…egration

gshtras approved these changes Jan 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster Custom Paged Attention kernels #372

Faster Custom Paged Attention kernels #372

sanyalington commented Jan 21, 2025 •

edited by github-actions bot

Loading

gshtras Jan 21, 2025

sanyalington Jan 21, 2025

tjtanaa Jan 23, 2025

gshtras Jan 23, 2025

gshtras Jan 21, 2025

sanyalington Jan 21, 2025

gshtras Jan 21, 2025

shajrawi left a comment

Faster Custom Paged Attention kernels #372

Are you sure you want to change the base?

Faster Custom Paged Attention kernels #372

Conversation

sanyalington commented Jan 21, 2025 • edited by github-actions bot Loading

gshtras Jan 21, 2025

Choose a reason for hiding this comment

sanyalington Jan 21, 2025

Choose a reason for hiding this comment

tjtanaa Jan 23, 2025

Choose a reason for hiding this comment

gshtras Jan 23, 2025

Choose a reason for hiding this comment

gshtras Jan 21, 2025

Choose a reason for hiding this comment

sanyalington Jan 21, 2025

Choose a reason for hiding this comment

gshtras Jan 21, 2025

Choose a reason for hiding this comment

shajrawi left a comment

Choose a reason for hiding this comment

sanyalington commented Jan 21, 2025 •

edited by github-actions bot

Loading