Hierarchical Caching for SGLang #2693

xiezhq-hermann · 2025-01-01T07:59:42Z

Motivation

While RadixTree-based context caching provides significant performance benefits, these gains are not always fully realized. A key bottleneck is the capacity limit of GPU memory. Currently, SGLang stores historical KV caches exclusively in GPU memory; whenever more memory is required for batch execution, existing caches are discarded.

To address this issue, we propose a hierarchical caching mechanism for LLM serving, treating GPU memory as an L1 cache, host memory as an L2 cache, and disk as an L3 cache (future). This PR introduces such a mechanism in SGLang through a separate host memory pool that backs up KV caches, allowing them to be reloaded into GPU memory when needed.

Modifications

A HiRadixCache that extends RadixCache with host memory addresses and synchronization mechanisms.
A host memory pool that synchronizes with the device memory pool of KV caches.
A memory controller that implements efficient data transfer between host and device, and handles various cache write policies for hierarchical caching.

Todo:

Update benchmark results.
Remove deprecated design and implementation.

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

zhyncs · 2025-01-01T08:01:17Z

It's amazing! Happy new year!

vanilla and selective write through

… transfer overhead

python/sglang/srt/mem_cache/radix_cache.py

msharmavikram · 2025-01-22T03:05:02Z

Hi,

Thanks for the great work. I am leaving a comment because my team is working on something similar.

As I understand your PR, your hierarchical radix caching treats a single sequence as a single unit for offloading, is this correct? That is, it offloads entire sequences except the current in-progress one at the sequence-level.

In contrast, our work-in-progress PR works closely with our block-sparse attention mechanism: pruned KV pages are dynamically offloaded during decoding and the necessary KV pages are dynamically fetched from the CPU, using CUDA UVM. In other words, our offloading is done at the KV page-level.

We are almost done implementing our dynamic KV cache offloading during decoding and chunked prefilling for our hierarchically pruned attention (HiP attention). Our HiP attention mechanism can serve longer sequences than the pre-trained model's limit without significant throughput and performance degradation while providing nearly linear attention complexity in prefill and decoding and training-free context extension. I have not yet pushed the updated library for HiP Attention kernels, but it is almost done. Hopefully, we will send a new PR before the next month.

Currently, we don't support NVMe-level offloading because we rely on CUDA UVM for CPU-GPU communication inside the attention kernel.

I think we can make a huge synergy effect if we combine our offloading-supported attention mechanism with this PR. I wonder if we can integrate both in the near future.

Thanks. Heejun and @mujjingun

Great work. However, UVM will only provide an extension to CPU memory, and going to disk will cause significant performance regressions. The right design for KV cache offloading is explicit management.

Pruning is a different concept and should not be attached to this PR. It should be a separate technique.

Please understand pruning is more like how the KV cache is organized and stored, and not managed. This PR targets the management side of the KV cache!

@zhyncs @ByronHsu Hey fellas - now you know what I was up to :P

gmlwns2000 · 2025-01-22T06:14:30Z

@msharmavikram Thanks for advice!

I was planning to make a new PR (Adding HiP attention, Support training-free context extension, Support UVM KV cache offloading for decoding), but I just wanted to ask the author of this PR before I try to integrate Hierarchical Caching with my method about whether that idea is good to go.

I think Hierarchical caching and UVM caching should be integrated because, in the long-context scenarios, the GPU memory is significantly limited in many use cases (consumer GPUs such as 4090 only have 24GB, which is good to handle up to 64~128K tokens. But I want to handle around 1M tokens to match up Gemini) We can extend the single sequence length by using UVM, but if then we are running out of CPU-GPU memory so we cannot utilize the radix caching properly. That is why I am trying to look into this PR.

However, I am currently working on other things (paper writing), so my new PR is getting delayed. I am sorry about that.

In addition to this, @xiezhq-hermann, I have concerns about the license of my HiP attention in my future PR. Can you check the new #3042 discussion I just made?

msharmavikram · 2025-01-22T17:39:52Z

Hierarchical caching and UVM caching are not the same. Hierarchical caching can use UVM caching as a mechanism or can do without UVM. What I am trying to say is - hierarchical caching is a superset and achieving it can be done by many mechanisms like UVM. This is why I strongly recommended to do a separate PR that extends this work such that both mechanisms are supported (with UVM and without).

gmlwns2000 · 2025-01-22T18:27:30Z

Now I understand that the concept of hierarchical coaching is trying to aim more general framework than what I thought. I will keep watching this PR, and I will implement this hierarchical caching for my attention mechanism in the future by following the proposed implementation in this PR.

Thanks!

Edenzzzz · 2025-01-26T14:32:59Z

Hierarchical caching and UVM caching are not the same. Hierarchical caching can use UVM caching as a mechanism or can do without UVM. What I am trying to say is - hierarchical caching is a superset and achieving it can be done by many mechanisms like UVM. This is why I strongly recommended to do a separate PR that extends this work such that both mechanisms are supported (with UVM and without).

I think UVM relies on page faults to fetch data, which has much higher overhead than writing a cache controller and can cause trashing? You can indeed use cudaMemPrefetchAsync and cudaMemAdvise, but no support from PyTorch(pytorch/pytorch#106200) probably due to the above reasons.

gmlwns2000 · 2025-01-26T15:43:59Z

@Edenzzzz Yes, I used cudaMemAdvise to make the pages stay mostly in the CPU. So, if what I understand is correct, the latency should be stable from the CPU side. However, I am not sure that speed is enough because I have never tested the CPU read latency after a few iterations of decoding requests. I think your concern is quite important, and I will check this issue when I start working on integrating the UVM cache into Hierarchical Caching as a cache layer.

Thanks for giving a comment.

xiezhq-hermann · 2025-01-28T06:40:17Z

After code cleaning and basic performance benchmark, this PR is ready to merge. You can add --enable-hierarchical-cache option when starting a SGLang server to turn on this feature. This feature will still be under active development in the future months, your feedback will be greatly welcomed : )
Following is a throughput v.s. median TTFT curve that demonstrates the benefit of hierarchical caching using a synthetic multi-turn benchmark, and you can reproduce it with Qwen/Qwen2.5-14B-Instruct on an A100-80G GPU as explained here:

xiezhq-hermann requested review from merrymercy, Ying1123, zhyncs, hnyls2002, ispobock and ByronHsu as code owners January 1, 2025 07:59

zhyncs added the enhancement New feature or request label Jan 1, 2025

xiezhq-hermann added 17 commits January 2, 2025 08:05

KV cache memory pool on host

f508c4a

hierarchical cache controller

0639ff5

radix tree for hierarchical cache

7de8fc5

minimal change to plug in hierarchical cache

217a9b2

remove duplicated code

09caf1c

hierarchiccal cache micro-benchmark

2d57a87

global CUDA synchronization to prevent illegal memory access

2007d1d

write through and back policies, deprecate write through revokable

d8b6b64

vanilla and selective write through

minor change on scheduler for hierarchical caching

8e68f71

fix rebase error

ff19db0

fix counter reset issue

1846672

bug fix for illegal memory access in pytorch indexing and reduce data…

2ff5297

… transfer overhead

draft multi turn benchmark

7bf5ff4

reorg multiturn benchmark

89fc497

bug fix for scheduler

191a02d

reorg test

81e39dc

fupdate format

1853cf2

xiezhq-hermann force-pushed the xiezhq-hierarchical branch from 2bd500a to 1853cf2 Compare January 2, 2025 08:06

merrymercy assigned Ying1123 Jan 2, 2025

merrymercy reviewed Jan 2, 2025

View reviewed changes

python/sglang/srt/mem_cache/radix_cache.py Outdated Show resolved Hide resolved

zhaohaidao reviewed Jan 4, 2025

View reviewed changes

python/sglang/srt/mem_cache/radix_cache.py Outdated Show resolved Hide resolved

python/sglang/srt/mem_cache/radix_cache.py Outdated Show resolved Hide resolved

bug fix for device of value

3a1e602

This was referenced Jan 17, 2025

Query remaining memory dynamically for PrefillAdder #2941

Merged

Multi-turn benchmark for hierarchical caching #2942

Merged

Edenzzzz and others added 2 commits January 20, 2025 10:51

Removing device sync overhead (#3011)

17ce0d2

bug fix for loading

c25e0a0

xiezhq-hermann added 4 commits January 24, 2025 01:09

bug fix for loading cache in scheduling

d39160c

dedup code

ebbed14

mark staging requests

47ad482

multi-turn benchmark refinement

89b4db8

xiezhq-hermann added 3 commits January 27, 2025 06:43

new overlaping for write through and graceful reset for cache controller

b546141

Merge branch 'main_origin' into xiezhq-hierarchical

ff328fc

sanity check to prevent performance regression

97a3c18

xiezhq-hermann mentioned this pull request Jan 27, 2025

Sanity check to prevent performance regression #3171

Merged

4 tasks

xiezhq-hermann added 7 commits January 27, 2025 08:21

Merge branch 'xiezhq-check' into xiezhq-hierarchical

349a982

clean up and brief doc

6c39cb7

Merge branch 'main' into xiezhq-hierarchical

727f779

format

691b7e0

add log file name

64fad0f

Merge branch 'main_origin' into xiezhq-hierarchical

fd47928

nic cleaning

e6d8ec8

xiezhq-hermann added 4 commits January 27, 2025 23:08

Merge branch 'main' into xiezhq-hierarchical

fcf2e8d

Merge branch 'main' into xiezhq-hierarchical

fe550c6

Merge branch 'main' into xiezhq-hierarchical

c31fdc1

Merge branch 'main' into xiezhq-hierarchical

ba7e737

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hierarchical Caching for SGLang #2693

Hierarchical Caching for SGLang #2693

xiezhq-hermann commented Jan 1, 2025 •

edited

Loading

zhyncs commented Jan 1, 2025

msharmavikram commented Jan 22, 2025 •

edited

Loading

gmlwns2000 commented Jan 22, 2025

msharmavikram commented Jan 22, 2025

gmlwns2000 commented Jan 22, 2025

Edenzzzz commented Jan 26, 2025

gmlwns2000 commented Jan 26, 2025

xiezhq-hermann commented Jan 28, 2025

Hierarchical Caching for SGLang #2693

Are you sure you want to change the base?

Hierarchical Caching for SGLang #2693

Conversation

xiezhq-hermann commented Jan 1, 2025 • edited Loading

Motivation

Modifications

Todo:

Checklist

zhyncs commented Jan 1, 2025

msharmavikram commented Jan 22, 2025 • edited Loading

gmlwns2000 commented Jan 22, 2025

msharmavikram commented Jan 22, 2025

gmlwns2000 commented Jan 22, 2025

Edenzzzz commented Jan 26, 2025

gmlwns2000 commented Jan 26, 2025

xiezhq-hermann commented Jan 28, 2025

xiezhq-hermann commented Jan 1, 2025 •

edited

Loading

msharmavikram commented Jan 22, 2025 •

edited

Loading