Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster Custom Paged Attention kernels #372

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

sanyalington
Copy link

@sanyalington sanyalington commented Jan 21, 2025

Faster CPA kernel based on mfma16x16x16 instructions.
Performance same as the numbers mentioned in #347
Fixed tests/kernels/test_attention.py::test_paged_attention unit test for CPA rocm kernel

@sanyalington sanyalington requested a review from gshtras January 21, 2025 16:08
@@ -117,7 +117,7 @@ def ref_single_query_cached_kv_attention(

@pytest.mark.parametrize(
"version",
["v1", "v2"] if not current_platform.is_rocm() else ["v1", "v2", "rocm"])
["v1", "v2"] if not current_platform.is_rocm() else ["rocm"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we still want to run unit tests on v1/v2? These backends are still supported

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gshtras Yes - but v1,v2 unit tests are broken - I saw errors when I tried running them.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have suggested fixes to the v1, v2 unit tests in this PR (#383)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's re-target #383 to this branch and combine them

}

#define CALL_CUSTOM_LAUNCHER(T, KVT, KV_DTYPE, BLK_SIZE, HEAD_SIZE, OUTT, \
PSIZE) \
PSIZE, ALIBI_ENABLED) \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tells me that the CI linters are broken...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed running clang-format, let me do that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the plus side, this exposed the CI linters issue :)
We'll hopefully get them back soon

tjtanaa pushed a commit to EmbeddedLLM/vllm that referenced this pull request Jan 23, 2025
Ported ROCm/vllm changes to upstream vLLM

This commit manually ports changes from ROCm/vllm (ROCm#372) to upstream vLLM.
The original work was done by sanyalington.

Co-authored-by: sanyalington <[email protected]>
tjtanaa pushed a commit to EmbeddedLLM/vllm that referenced this pull request Jan 23, 2025
Ported ROCm/vllm changes to upstream vLLM

This commit manually ports changes from ROCm/vllm (ROCm#372) to upstream vLLM.
The original work was done by sanyalington.

Co-authored-by: sanyalington <[email protected]>

Signed-off-by: vllmellm <[email protected]>
shajrawi
shajrawi previously approved these changes Jan 24, 2025
Copy link
Collaborator

@shajrawi shajrawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving based on latest accuracy numbers.

* [Bugfix]: fix paged attention tests based on the updated kernels in `csrc/attention/paged_attention_v1.cu`,`csrc/attention/paged_attention_v2.cu` and  `csrc/rocm/attention.cu`.

* improve code documentation.

* lint

---------

Co-authored-by: vllmellm <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants