-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Support for 4-bit KV Cache in paged-attention op #4025
Comments
any update on this new feature? i am looking for this useful feature, In my scenario, I mainly use A10 cards, int4 kvcache can effectively increase my kv cache size. |
I am currently primarily focused on developing this issue, and I plan to start working on the development of the int4 kv cache from May. |
I'm very sorry, but due to some reasons, I need to put the development of this issue on hold. If there are others who are interested in this feature, we can reopen this issue. |
The problem is that vLLM doesn't support ext2, which would have given us so much more options. Having this Q4 Cache will reduce the VRAM usage a bit more, which would be very helpful. Of course the best solution is supporting exl2, then this feature could be de-priotised, but right now, it's difficult to justify vLLM, when aphrodite supports exl2 out of the box and fits properly on a 48 GB GPU. |
Sorry forgot to tag you @yukavio |
Hi, is there any plan to support 4-bit KV Cache recently?
|
Hi, is there any updates? Thanks! |
any news on this feature? 🤘 |
🚀 The feature, motivation and pitch
Summary
We would like to support the 4-bit KV cache for the decoding phase. The purpose of this feature is to reduce the GPU memory usage of the KV cache when processing long texts. By implementing a 4-bit KV cache, it would allow us to handle more and longer texts in situations where GPU memory is limited. Although VLLM currently has an implementation for fp8, utilizing int4 can further reduce GPU memory usage and allow for usage on devices that do not support the fp8 data format, such as A100.
methods
Regarding the specific implementation, we propose the development of three operations:
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: