Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel][Model] PagedAttention: Support custom attention bias for T5 model (1/2) #11334

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

NickLucche
Copy link
Contributor

@NickLucche NickLucche commented Dec 19, 2024

This contribution is part of 2 PRs attempting to add support for the T5 model.
Based on the great work done in #7366 (comment) and #3117.

It implements:

$$ Attention(Q, K, V) = \text{softmax}\left(\frac{QK^\top + A}{\sqrt{d_k}}\right)V $$

In particular, the following aims to be a more generalized addition (not tied to T5), enabling any
custom attention bias in the PagedAttention kernel.

While I am aware this introduces a fully materialized matrix (akin to what's done in xformers),
currently padded to max_seq_len, I believe the flexibility brought by allowing any custom bias
would still pay for itself when providing initial compatibility for yet unsupported models (such as T5).
I'd be happy to focus on performance optimization later (eg variable len bias matrix or flashattn-like IO optimization)
or even have yet another model-specific feature in the kernel for T5 to compute relative positional embeddings.

TODO work left for future contribs (ie extend to all platforms):

  • ROCm attn bias support
  • CPU attn bias support
  • ..?

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

qk += (alibi_slope != 0) ? alibi_slope * (token_idx - seq_len + 1) : 0;
qk += (attn_bias_vec != nullptr) ? attn_bias_vec[token_idx] : 0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T5 uses relative bias, so I believe attn_bias_vec should be instead indexed with seq_len - token_idx - 1.

Suggested change
qk += (attn_bias_vec != nullptr) ? attn_bias_vec[token_idx] : 0;
const int custom_bias_idx = max(min(seq_len - token_idx - 1, padded_max_seq_len), 0);
qk += (attn_bias_vec != nullptr) ? attn_bias_vec[custom_bias_idx] : 0;

I have tested it with T5 using similar implementation.

@sfc-gh-phalama
Copy link

While I am aware this introduces a fully materialized matrix (akin to what's done in xformers),
currently padded to max_seq_len

@NickLucche In case of T5, it can be limited to the maximum relative distance, which for T5 Large is 128. Anything beyond 128 just uses the last value from the bias tensor.

@NickLucche
Copy link
Contributor Author

NickLucche commented Dec 20, 2024

Thanks for the quick input @sfc-gh-phalama, personally I'd rather not have the attn bias implementation depend or be tied to T5/relative positional encoding mechanisms, otherwise we could probably implement the inline formula.

@NickLucche NickLucche force-pushed the custom-attn-bias-signed branch from 7c8dd7e to 5c47f43 Compare January 9, 2025 14:04
@joerunde joerunde added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 13, 2025
@NickLucche NickLucche force-pushed the custom-attn-bias-signed branch from e858a1f to f37d776 Compare January 15, 2025 18:38
@NickLucche
Copy link
Contributor Author

NickLucche commented Jan 15, 2025

Fixed the kernel tests failing and rebased, not sure how related the rest of errors may be tbh..
https://buildkite.com/vllm/ci/builds/12022#01946b42-da66-4512-919d-c2115515eb6d/147-444

AssertionError: please set tensor_parallel_size (4) to less than max local gpu count (1)
looks more like a CI/node issue

@joerunde
Copy link
Collaborator

@NickLucche the gpu count issue might be fixed by this PR: #12111

Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
@NickLucche NickLucche force-pushed the custom-attn-bias-signed branch from f37d776 to ca9562b Compare January 21, 2025 15:32
@NickLucche
Copy link
Contributor Author

CI failures still look unrelated to me :/

@NickLucche
Copy link
Contributor Author

I believe this PR could also be useful in places like https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/mllama.py#L863, where we can pass in the custom attention mask as bias term A, while getting rid of the logic to handle kv cache in the model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants