-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1][WIP] Hybrid allocator for full attention & sliding window attention interleaved models (Reference PR, do not merge) #11938
base: main
Are you sure you want to change the base?
[V1][WIP] Hybrid allocator for full attention & sliding window attention interleaved models (Reference PR, do not merge) #11938
Conversation
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]> (cherry picked from commit 176dc6d)
Signed-off-by: Chen Zhang <[email protected]> (cherry picked from commit c5a5155)
Signed-off-by: Chen Zhang <[email protected]> (cherry picked from commit de8324b)
Signed-off-by: Chen Zhang <[email protected]> (cherry picked from commit fa9b0bb)
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
This pull request has merge conflicts that must be resolved before it can be |
This pr implements step 1 of #11382 , so that
Benchmark result (accelerate hybrid model & very little overhead on standard full attention models)
Key modifications
num_gpu_blocks, _ = self.model_executor.determine_num_available_blocks()
self.model_executor.initialize(num_gpu_blocks)
(allocate kv cache)I plan to split it into the following prs:
HybridKVCacheManager
, which is a pluggable alternative withKVCacheManager
, and won't touch the code path for standard models with only full attention layers.