Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hierarchical Caching for SGLang #2693

Open
wants to merge 54 commits into
base: main
Choose a base branch
from
Open

Conversation

xiezhq-hermann
Copy link
Collaborator

@xiezhq-hermann xiezhq-hermann commented Jan 1, 2025

Motivation

While RadixTree-based context caching provides significant performance benefits, these gains are not always fully realized. A key bottleneck is the capacity limit of GPU memory. Currently, SGLang stores historical KV caches exclusively in GPU memory; whenever more memory is required for batch execution, existing caches are discarded.

To address this issue, we propose a hierarchical caching mechanism for LLM serving, treating GPU memory as an L1 cache, host memory as an L2 cache, and disk as an L3 cache (future). This PR introduces such a mechanism in SGLang through a separate host memory pool that backs up KV caches, allowing them to be reloaded into GPU memory when needed.

Modifications

  • A HiRadixCache that extends RadixCache with host memory addresses and synchronization mechanisms.
  • A host memory pool that synchronizes with the device memory pool of KV caches.
  • A memory controller that implements efficient data transfer between host and device, and handles various cache write policies for hierarchical caching.

Todo:

  • Update benchmark results.
  • Remove deprecated design and implementation.

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@zhyncs
Copy link
Member

zhyncs commented Jan 1, 2025

It's amazing! Happy new year!

@zhyncs zhyncs added the enhancement New feature or request label Jan 1, 2025
@msharmavikram
Copy link

msharmavikram commented Jan 22, 2025

Hi,

Thanks for the great work. I am leaving a comment because my team is working on something similar.

As I understand your PR, your hierarchical radix caching treats a single sequence as a single unit for offloading, is this correct? That is, it offloads entire sequences except the current in-progress one at the sequence-level.

In contrast, our work-in-progress PR works closely with our block-sparse attention mechanism: pruned KV pages are dynamically offloaded during decoding and the necessary KV pages are dynamically fetched from the CPU, using CUDA UVM. In other words, our offloading is done at the KV page-level.

We are almost done implementing our dynamic KV cache offloading during decoding and chunked prefilling for our hierarchically pruned attention (HiP attention). Our HiP attention mechanism can serve longer sequences than the pre-trained model's limit without significant throughput and performance degradation while providing nearly linear attention complexity in prefill and decoding and training-free context extension. I have not yet pushed the updated library for HiP Attention kernels, but it is almost done. Hopefully, we will send a new PR before the next month.

Currently, we don't support NVMe-level offloading because we rely on CUDA UVM for CPU-GPU communication inside the attention kernel.

I think we can make a huge synergy effect if we combine our offloading-supported attention mechanism with this PR. I wonder if we can integrate both in the near future.

Thanks. Heejun and @mujjingun

Great work. However, UVM will only provide an extension to CPU memory, and going to disk will cause significant performance regressions. The right design for KV cache offloading is explicit management.

Pruning is a different concept and should not be attached to this PR. It should be a separate technique.

Please understand pruning is more like how the KV cache is organized and stored, and not managed. This PR targets the management side of the KV cache!

@zhyncs @ByronHsu Hey fellas - now you know what I was up to :P

@gmlwns2000
Copy link

@msharmavikram Thanks for advice!

I was planning to make a new PR (Adding HiP attention, Support training-free context extension, Support UVM KV cache offloading for decoding), but I just wanted to ask the author of this PR before I try to integrate Hierarchical Caching with my method about whether that idea is good to go.

I think Hierarchical caching and UVM caching should be integrated because, in the long-context scenarios, the GPU memory is significantly limited in many use cases (consumer GPUs such as 4090 only have 24GB, which is good to handle up to 64~128K tokens. But I want to handle around 1M tokens to match up Gemini) We can extend the single sequence length by using UVM, but if then we are running out of CPU-GPU memory so we cannot utilize the radix caching properly. That is why I am trying to look into this PR.

However, I am currently working on other things (paper writing), so my new PR is getting delayed. I am sorry about that.


In addition to this, @xiezhq-hermann, I have concerns about the license of my HiP attention in my future PR. Can you check the new #3042 discussion I just made?

@msharmavikram
Copy link

Hierarchical caching and UVM caching are not the same. Hierarchical caching can use UVM caching as a mechanism or can do without UVM. What I am trying to say is - hierarchical caching is a superset and achieving it can be done by many mechanisms like UVM. This is why I strongly recommended to do a separate PR that extends this work such that both mechanisms are supported (with UVM and without).

@gmlwns2000
Copy link

Now I understand that the concept of hierarchical coaching is trying to aim more general framework than what I thought. I will keep watching this PR, and I will implement this hierarchical caching for my attention mechanism in the future by following the proposed implementation in this PR.

Thanks!

@Edenzzzz
Copy link
Contributor

Hierarchical caching and UVM caching are not the same. Hierarchical caching can use UVM caching as a mechanism or can do without UVM. What I am trying to say is - hierarchical caching is a superset and achieving it can be done by many mechanisms like UVM. This is why I strongly recommended to do a separate PR that extends this work such that both mechanisms are supported (with UVM and without).

I think UVM relies on page faults to fetch data, which has much higher overhead than writing a cache controller and can cause trashing? You can indeed use cudaMemPrefetchAsync and cudaMemAdvise, but no support from PyTorch(pytorch/pytorch#106200) probably due to the above reasons.

@gmlwns2000
Copy link

@Edenzzzz Yes, I used cudaMemAdvise to make the pages stay mostly in the CPU. So, if what I understand is correct, the latency should be stable from the CPU side. However, I am not sure that speed is enough because I have never tested the CPU read latency after a few iterations of decoding requests. I think your concern is quite important, and I will check this issue when I start working on integrating the UVM cache into Hierarchical Caching as a cache layer.

Thanks for giving a comment.

@xiezhq-hermann
Copy link
Collaborator Author

After code cleaning and basic performance benchmark, this PR is ready to merge. You can add --enable-hierarchical-cache option when starting a SGLang server to turn on this feature. This feature will still be under active development in the future months, your feedback will be greatly welcomed : )
Following is a throughput v.s. median TTFT curve that demonstrates the benefit of hierarchical caching using a synthetic multi-turn benchmark, and you can reproduce it with Qwen/Qwen2.5-14B-Instruct on an A100-80G GPU as explained here:

throughput_latency_curve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants