Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on Quest implemenation in KV cache selection #4

Open
Monstertail opened this issue Dec 21, 2024 · 3 comments
Open

Question on Quest implemenation in KV cache selection #4

Monstertail opened this issue Dec 21, 2024 · 3 comments

Comments

@Monstertail
Copy link

Monstertail commented Dec 21, 2024

Thanks for your great work and corresponding implementation of baselines! It really benefits the future work a lot!
I have a question about the implementation of KV cache selection in Quest.

It looks like in this repo, the Quest cache will select all the generated tokens (see

self.selected_key_cache[layer_idx] = torch.cat([self.selected_key_cache[layer_idx], key_states], dim=-2)
),
The selection of tokens is limited to the prompt tokens because the KV page is built only on prefill.

It looks like the original Quest implementation (https://github.com/mit-han-lab/Quest/blob/main/evaluation/quest_attention.py ) will dynamically update the KV pages during the decoding.

Will you take the implementation of dynamic KV page updating into consideration? I implemented a very simple but not perfect version for this here: https://github.com/Monstertail/MagicPIG/blob/b635d06ae2c68c1d2949f2e95f358fb5746f6108/RULER/RULER/scripts/pred/quest_cache.py#L253 . If you are interested, we can think about how to make it better together.

@dreaming-panda
Copy link
Contributor

Dynamically maintaining the quest page may need some effort. Currently, I just select all the generated tokens since most of the evaluated tasks will not have a lot of generated tokens. Thank you for your efforts, and I will carefully examine your codes. You are also welcome to submit your PR if you want!

Thank you again!

@Monstertail
Copy link
Author

Hi, sorry for the late reply. Congrats on your acceptance of MagicPig by ICLR! I agree that controlling all methods to keep generated tokens is fair to show the effectiveness of MagicPig.

As you target long prompt short-generation scenarios, I would like to ask whether you will take short prompt long generation into consideration in the future.

Thanks for your great work, that's my favorite KV compression paper!

@dreaming-panda
Copy link
Contributor

short prompt long generation

It might become popular as reasoning models gain attention. However, I have not figured out what the most efficient and accurate way to design algorithms to do this is. Also, it seems reproducing the results of the long-generation models (e.g., R1 will generate 32K tokens to solve some math problems) is very time-consuming, and I did not even know what can be a good metric.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants