-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance cache service metrics #13
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
change cache metrics to maintain fixed-sized queue Adding metrics to llm engine stats and exposed through prometheus gauge Adding metrics to llm engine stats and exposed through prometheus gauge clear metrics counters upon update
f93fa67
to
bc4f074
Compare
@happyandslow please help address comments as well. We need to ack reviewer's comments. |
vllm/engine/metrics.py
Outdated
@@ -190,6 +190,89 @@ def __init__(self, labelnames: List[str], max_model_len: int): | |||
labelnames=labelnames, | |||
multiprocess_mode="sum", | |||
) | |||
|
|||
self.gauge_cache_service_tokens_hit_rate = self._gauge_cls( | |||
name="vllm:gauge_cache_service_tokens_hit_rate", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the naming is kind of inconsistent here. Other metric string doesn't include the metric type. for example, "vllm:gauge_cache_service_tokens_hit_rate" -> "vllm:cache_service_tokens_hit_rate"
can you update the name to be consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check existing metrics names -- all metrics start with the metrics types they use (e.g., line vllm-project#81 self.gauge_gpu_prefix_cache_hit_rate = self._gauge_cls(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@happyandslow Could you look at the review comments? Seems the new commit didn't address them. |
addressing comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall looks good to me. Let's merge it and run some testing
vllm/engine/llm_engine.py
Outdated
self.cache_service_metrics.time_load.clear() | ||
self.cache_service_metrics.time_reshape.clear() | ||
self.cache_service_metrics.time_unload.clear() | ||
self.cache_service_metrics.time_update.clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simply swap the corresponding lists, e.g., cache_service_time_query
(initially empty) with self.cache_service_metrics.time_query
, instead of of copying the whole list of elements?
duration = copy_start.elapsed_time(copy_end) / 1000.0 | ||
self.metrics.add_time_unload(duration) | ||
|
||
torch.cuda.synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this torch.cuda.synchronize()
being used to measure execution time accurately, or is it intended as a synchronization barrier between the copy_
and reshape_and_cache_flash
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of feel this is redundant.
No description provided.