-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] Move more control of kv cache initialization from model_executor to EngineCore #11960
Conversation
Signed-off-by: Chen Zhang <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Most comments are style changes. One important thing: Please docstring all functions and files in the desired format:
def func_name(arg1, arg2):
"""What does this function do?
Args:
arg1: ...
arg2: ...
Returns:
... (skip if return None) ...
"""
vllm/v1/core/kv_cache_utils.py
Outdated
return kv_cache_config, num_gpu_blocks | ||
|
||
|
||
def is_same_key(kv_cache_spec: KVCacheSpec) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we try to make this name more informative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KVCacheSpecBase.key
-> KVCacheSpecBase.type_id
is_same_key
-> is_same_type
vllm/v1/utils.py
Outdated
def bind_kv_cache( | ||
ctx: Dict[str, Any], | ||
runner_kv_caches: List[torch.Tensor], | ||
kv_caches: Dict[str, torch.Tensor], | ||
) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be in the kv_cache_utils.py as it is kv cache related?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bind_kv_cache
is called by GPUModelRunner
. I think it is strange to let GPUModelRunner
call a function inside core
. These two parts should be independent.
vllm/v1/kv_cache_interface.py
Outdated
# [group_id][layer_name in the group]. One group containing all | ||
# layer_names if the Spec for kv_cache of all layers are the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand [group_id][layer_name in the group]
. What's group ID? It might be better to just show an example.
# A list of kv-cache groups. Each group includes a set of layers with
# the same kv-cache spec. For example: ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments updated. Do you feel better now?
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Co-authored-by: Cody Yu <[email protected]> Signed-off-by: Chen Zhang <[email protected]>
Co-authored-by: Cody Yu <[email protected]> Signed-off-by: Chen Zhang <[email protected]>
b3ca6f2
to
97176da
Compare
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
@comaniac Thank you for the review. I've updated the code based on your suggestions. Can you take another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be the last batch of comments. Approve to unblock the PR first.
vllm/v1/core/kv_cache_utils.py
Outdated
f"initializing the engine.") | ||
|
||
|
||
def is_same_type(kv_cache_spec: KVCacheSpec) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this function name is a bit unclear. It actually checks whether the "kv cache specs" of "all" layers are the same, so it should be informative about "spec", "all" and "same". Maybe "is_uniformed_kv_cache_type" or something like that would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to is_kv_cache_type_uniform
vllm/v1/core/kv_cache_utils.py
Outdated
def get_kv_cache_config(vllm_config: VllmConfig, kv_cache_spec: KVCacheSpec, | ||
available_memory: int) -> Tuple[KVCacheConfig, int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again this function name (also the callee _get_kv_cache_config_same_type
) doesn't illustrate the number of GPU blocks. It's weird to see config, num_blocks = get_kv_cache_config(...)
.
One possibility:
def get_kv_cache_config_and_available_blocks(...):
check_enough_kv_cache_memory(...)
# Later maybe you can introduce a registry when you have more policies.
if is_uniformed_kv_cache_type(...):
return _get_kv_cache_config_and_blocks_for_unifiemd_type(...)
return _get_kv_cache_config_and_blocks_for_xxx(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed num_blocks
to an attribute of KVCacheConfig
kv_cache_specs = self.collective_rpc("get_kv_cache_spec") | ||
assert all(lc == kv_cache_specs[0] for lc in kv_cache_specs) | ||
return kv_cache_specs[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would this be extended later if you have different specs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't be extended. kv_cache_spec[i] is for all layers of one GPU. Different TP GPUs always have the same spec though the spec of each layer inside one GPU can be different.
PP executors of different stages can have different specs.
vllm/v1/utils.py
Outdated
Bind kv_caches to the forward context and model_runner's kv_cache. | ||
Args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please elaborate more, because the concept of "binding" kv-cache is not common for most people. For example, bind what kv-cache to what, and what's the purpose.
vllm/v1/utils.py
Outdated
assert all(kv_caches[n] is kv_caches[layer_name] | ||
for n in layer_names[1:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment if you're going to extend this logic for xxx; otherwise it looks a bit weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to raise an error if multiple attention layers have the same layer_index.
Layer_name and layer_index are defined by model implementation instead of this pr. There will be no conflict of layer_index in decoder-only models.
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please also sync the main branch. The SPMD PR may have incompatible changes to this one
vllm/v1/core/kv_cache_utils.py
Outdated
Args: | ||
kv_cache_spec (KVCacheSpec): The KVCacheSpec of the model | ||
|
||
Returns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indent
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
@comaniac Thanks for the review. I've update this PR. |
Trying to fix the CI failure with #12138 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not really v1-specific, I think we should just make it for v0 and v1 directly. I really hate code duplication, it makes later bugfix and v0-v1 agnostic features difficult to develop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will move it to v0 & v1 common code if we need these interfaces for v0. But as I don't plan to support Hybrid allocator on v0, I prefer not to introduce these concepts in v0 at this moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is heavy code duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running the same test as v0, but the detail workflow is very different due to the dict->list and list->dict difference.
self._run_workers("compile_or_warm_up_model") | ||
|
||
def get_kv_cache_spec(self) -> KVCacheSpec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this executor is not used. you should add it in vllm/executor/executor_base.py
, and do not use _run_workers
. use collective_rpc
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This have been done in your PR #12171 . Thanks for it.
… to EngineCore (vllm-project#11960) Signed-off-by: Chen Zhang <[email protected]> Co-authored-by: Cody Yu <[email protected]> Signed-off-by: ice-tong <[email protected]>
… to EngineCore (vllm-project#11960) Signed-off-by: Chen Zhang <[email protected]> Co-authored-by: Cody Yu <[email protected]>
… to EngineCore (vllm-project#11960) Signed-off-by: Chen Zhang <[email protected]> Co-authored-by: Cody Yu <[email protected]>
… to EngineCore (vllm-project#11960) Signed-off-by: Chen Zhang <[email protected]> Co-authored-by: Cody Yu <[email protected]> Signed-off-by: Bowen Wang <[email protected]>
… to EngineCore (vllm-project#11960) Signed-off-by: Chen Zhang <[email protected]> Co-authored-by: Cody Yu <[email protected]>
This pr changes the workflow of
EngineCore._initialize_kv_caches
to enable more flexible control of kv cache format in the future.It is splitted from #11938 and is a preparation for #11382
Original workflow:
New workflow:
This pr introduces 2 new concepts:
KVCacheSpec
, a data structure to represent the kv cache needed by each attention layer, which is constructed by asking the model runner to analyze all Attention modules. Will add more types of Spec in the future, e.g.,SlidingWindowSpec
,MLASpec
KVCacheConfig
, a class to represent how to allocate the kv cache Tensor. It is quite simple now, i.e., tensors with the same size. But it may be extended to the following cases: