Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Support for 4-bit KV Cache in paged-attention op #4025

Closed
yukavio opened this issue Apr 12, 2024 · 8 comments
Closed

[Feature]: Support for 4-bit KV Cache in paged-attention op #4025

yukavio opened this issue Apr 12, 2024 · 8 comments

Comments

@yukavio
Copy link

yukavio commented Apr 12, 2024

🚀 The feature, motivation and pitch

Summary

We would like to support the 4-bit KV cache for the decoding phase. The purpose of this feature is to reduce the GPU memory usage of the KV cache when processing long texts. By implementing a 4-bit KV cache, it would allow us to handle more and longer texts in situations where GPU memory is limited. Although VLLM currently has an implementation for fp8, utilizing int4 can further reduce GPU memory usage and allow for usage on devices that do not support the fp8 data format, such as A100.

methods

Regarding the specific implementation, we propose the development of three operations:

  1. Develop an operation to calculate the scale and zero point required for quantizing the KV cache and convert the existing fp16/bf16 KV cache to the int4 format.
  2. Provide support for storing 4-bit KV cache in the "write_to_paged_cache" operation.
  3. Enhance the paged-attention operation to support calculations with int4 KV cache:
    • Add optional inputs: k_scale, k_zeropoint, v_scale, and v_zeropoint to the paged-attention operation.
    • In the paged-attention kernel, if quantization-related parameters are detected, read the int4 KV cache stored in the GPU's global memory, convert it to fp16/bf16 representation, and perform subsequent calculations.

Alternatives

No response

Additional context

No response

@smallsunsun1
Copy link

any update on this new feature? i am looking for this useful feature, In my scenario, I mainly use A10 cards, int4 kvcache can effectively increase my kv cache size.

@yukavio
Copy link
Author

yukavio commented Apr 19, 2024

I am currently primarily focused on developing this issue, and I plan to start working on the development of the int4 kv cache from May.

@yukavio
Copy link
Author

yukavio commented May 7, 2024

I'm very sorry, but due to some reasons, I need to put the development of this issue on hold. If there are others who are interested in this feature, we can reopen this issue.

@yukavio yukavio closed this as completed May 7, 2024
@houmie
Copy link

houmie commented May 7, 2024

The problem is that vLLM doesn't support ext2, which would have given us so much more options.
So we are pretty much stuck with AWQ quantisation. Currently vLLM with Llama 3 70B doesn't fit properly on a 48 GB GPU despite a 4bit AWQ quantisation. I even enabled enforce_eager but it still runs out of memory sometimes. It is not stable enough.

Having this Q4 Cache will reduce the VRAM usage a bit more, which would be very helpful.

Of course the best solution is supporting exl2, then this feature could be de-priotised, but right now, it's difficult to justify vLLM, when aphrodite supports exl2 out of the box and fits properly on a 48 GB GPU.

@houmie
Copy link

houmie commented May 7, 2024

Sorry forgot to tag you @yukavio
Thanks

@SherrySwift
Copy link

Hi, is there any plan to support 4-bit KV Cache recently?

I am currently primarily focused on developing this issue, and I plan to start working on the development of the int4 kv cache from May.

@fzyzcjy
Copy link
Contributor

fzyzcjy commented Nov 26, 2024

Hi, is there any updates? Thanks!

@puppetm4st3r
Copy link

any news on this feature? 🤘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants