-
Notifications
You must be signed in to change notification settings - Fork 805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hierarchical Caching for SGLang #2693
base: main
Are you sure you want to change the base?
Conversation
It's amazing! Happy new year! |
vanilla and selective write through
… transfer overhead
2bd500a
to
1853cf2
Compare
Great work. However, UVM will only provide an extension to CPU memory, and going to disk will cause significant performance regressions. The right design for KV cache offloading is explicit management. Pruning is a different concept and should not be attached to this PR. It should be a separate technique. Please understand pruning is more like how the KV cache is organized and stored, and not managed. This PR targets the management side of the KV cache! @zhyncs @ByronHsu Hey fellas - now you know what I was up to :P |
@msharmavikram Thanks for advice! I was planning to make a new PR (Adding HiP attention, Support training-free context extension, Support UVM KV cache offloading for decoding), but I just wanted to ask the author of this PR before I try to integrate Hierarchical Caching with my method about whether that idea is good to go. I think Hierarchical caching and UVM caching should be integrated because, in the long-context scenarios, the GPU memory is significantly limited in many use cases (consumer GPUs such as 4090 only have 24GB, which is good to handle up to 64~128K tokens. But I want to handle around 1M tokens to match up Gemini) We can extend the single sequence length by using UVM, but if then we are running out of CPU-GPU memory so we cannot utilize the radix caching properly. That is why I am trying to look into this PR. However, I am currently working on other things (paper writing), so my new PR is getting delayed. I am sorry about that. In addition to this, @xiezhq-hermann, I have concerns about the license of my HiP attention in my future PR. Can you check the new #3042 discussion I just made? |
Hierarchical caching and UVM caching are not the same. Hierarchical caching can use UVM caching as a mechanism or can do without UVM. What I am trying to say is - hierarchical caching is a superset and achieving it can be done by many mechanisms like UVM. This is why I strongly recommended to do a separate PR that extends this work such that both mechanisms are supported (with UVM and without). |
Now I understand that the concept of hierarchical coaching is trying to aim more general framework than what I thought. I will keep watching this PR, and I will implement this hierarchical caching for my attention mechanism in the future by following the proposed implementation in this PR. Thanks! |
I think UVM relies on page faults to fetch data, which has much higher overhead than writing a cache controller and can cause trashing? You can indeed use |
@Edenzzzz Yes, I used cudaMemAdvise to make the pages stay mostly in the CPU. So, if what I understand is correct, the latency should be stable from the CPU side. However, I am not sure that speed is enough because I have never tested the CPU read latency after a few iterations of decoding requests. I think your concern is quite important, and I will check this issue when I start working on integrating the UVM cache into Hierarchical Caching as a cache layer. Thanks for giving a comment. |
After code cleaning and basic performance benchmark, this PR is ready to merge. You can add |
Motivation
While RadixTree-based context caching provides significant performance benefits, these gains are not always fully realized. A key bottleneck is the capacity limit of GPU memory. Currently, SGLang stores historical KV caches exclusively in GPU memory; whenever more memory is required for batch execution, existing caches are discarded.
To address this issue, we propose a hierarchical caching mechanism for LLM serving, treating GPU memory as an L1 cache, host memory as an L2 cache, and disk as an L3 cache (future). This PR introduces such a mechanism in SGLang through a separate host memory pool that backs up KV caches, allowing them to be reloaded into GPU memory when needed.
Modifications
HiRadixCache
that extendsRadixCache
with host memory addresses and synchronization mechanisms.Todo:
Checklist