[Feature]: Support for 4-bit KV Cache in paged-attention op #4025

yukavio · 2024-04-12T03:48:55Z

🚀 The feature, motivation and pitch

Summary

We would like to support the 4-bit KV cache for the decoding phase. The purpose of this feature is to reduce the GPU memory usage of the KV cache when processing long texts. By implementing a 4-bit KV cache, it would allow us to handle more and longer texts in situations where GPU memory is limited. Although VLLM currently has an implementation for fp8, utilizing int4 can further reduce GPU memory usage and allow for usage on devices that do not support the fp8 data format, such as A100.

methods

Regarding the specific implementation, we propose the development of three operations:

Develop an operation to calculate the scale and zero point required for quantizing the KV cache and convert the existing fp16/bf16 KV cache to the int4 format.
Provide support for storing 4-bit KV cache in the "write_to_paged_cache" operation.
Enhance the paged-attention operation to support calculations with int4 KV cache:
- Add optional inputs: k_scale, k_zeropoint, v_scale, and v_zeropoint to the paged-attention operation.
- In the paged-attention kernel, if quantization-related parameters are detected, read the int4 KV cache stored in the GPU's global memory, convert it to fp16/bf16 representation, and perform subsequent calculations.

Alternatives

No response

Additional context

No response

smallsunsun1 · 2024-04-18T08:40:48Z

any update on this new feature? i am looking for this useful feature, In my scenario, I mainly use A10 cards, int4 kvcache can effectively increase my kv cache size.

yukavio · 2024-04-19T03:19:33Z

I am currently primarily focused on developing this issue, and I plan to start working on the development of the int4 kv cache from May.

yukavio · 2024-05-07T10:59:27Z

I'm very sorry, but due to some reasons, I need to put the development of this issue on hold. If there are others who are interested in this feature, we can reopen this issue.

houmie · 2024-05-07T12:52:07Z

The problem is that vLLM doesn't support ext2, which would have given us so much more options.
So we are pretty much stuck with AWQ quantisation. Currently vLLM with Llama 3 70B doesn't fit properly on a 48 GB GPU despite a 4bit AWQ quantisation. I even enabled enforce_eager but it still runs out of memory sometimes. It is not stable enough.

Having this Q4 Cache will reduce the VRAM usage a bit more, which would be very helpful.

Of course the best solution is supporting exl2, then this feature could be de-priotised, but right now, it's difficult to justify vLLM, when aphrodite supports exl2 out of the box and fits properly on a 48 GB GPU.

houmie · 2024-05-07T12:53:16Z

Sorry forgot to tag you @yukavio
Thanks

SherrySwift · 2024-10-10T07:14:46Z

Hi, is there any plan to support 4-bit KV Cache recently?

I am currently primarily focused on developing this issue, and I plan to start working on the development of the int4 kv cache from May.

fzyzcjy · 2024-11-26T08:43:23Z

Hi, is there any updates? Thanks!

puppetm4st3r · 2024-12-22T21:50:38Z

any news on this feature? 🤘

yukavio added the feature request label Apr 12, 2024

yukavio closed this as completed May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Support for 4-bit KV Cache in paged-attention op #4025

[Feature]: Support for 4-bit KV Cache in paged-attention op #4025

yukavio commented Apr 12, 2024

smallsunsun1 commented Apr 18, 2024

yukavio commented Apr 19, 2024

yukavio commented May 7, 2024

houmie commented May 7, 2024

houmie commented May 7, 2024

SherrySwift commented Oct 10, 2024

fzyzcjy commented Nov 26, 2024

puppetm4st3r commented Dec 22, 2024

[Feature]: Support for 4-bit KV Cache in paged-attention op #4025

[Feature]: Support for 4-bit KV Cache in paged-attention op #4025

Comments

yukavio commented Apr 12, 2024

🚀 The feature, motivation and pitch

Summary

methods

Alternatives

Additional context

smallsunsun1 commented Apr 18, 2024

yukavio commented Apr 19, 2024

yukavio commented May 7, 2024

houmie commented May 7, 2024

houmie commented May 7, 2024

SherrySwift commented Oct 10, 2024

fzyzcjy commented Nov 26, 2024

puppetm4st3r commented Dec 22, 2024