Question on Quest implemenation in KV cache selection #4

Monstertail · 2024-12-21T17:32:34Z

Thanks for your great work and corresponding implementation of baselines! It really benefits the future work a lot!
I have a question about the implementation of KV cache selection in Quest.

It looks like in this repo, the Quest cache will select all the generated tokens (see

MagicPIG/evaluations/RULER/pred/quest_cache.py

Line 127 in ac9aa36

    
           self.selected_key_cache[layer_idx] = torch.cat([self.selected_key_cache[layer_idx], key_states], dim=-2)

),
The selection of tokens is limited to the prompt tokens because the KV page is built only on prefill.

It looks like the original Quest implementation (https://github.com/mit-han-lab/Quest/blob/main/evaluation/quest_attention.py ) will dynamically update the KV pages during the decoding.

Will you take the implementation of dynamic KV page updating into consideration? I implemented a very simple but not perfect version for this here: https://github.com/Monstertail/MagicPIG/blob/b635d06ae2c68c1d2949f2e95f358fb5746f6108/RULER/RULER/scripts/pred/quest_cache.py#L253 . If you are interested, we can think about how to make it better together.

dreaming-panda · 2024-12-22T02:38:48Z

Dynamically maintaining the quest page may need some effort. Currently, I just select all the generated tokens since most of the evaluated tasks will not have a lot of generated tokens. Thank you for your efforts, and I will carefully examine your codes. You are also welcome to submit your PR if you want!

Thank you again!

Monstertail · 2025-02-08T22:39:29Z

Hi, sorry for the late reply. Congrats on your acceptance of MagicPig by ICLR! I agree that controlling all methods to keep generated tokens is fair to show the effectiveness of MagicPig.

As you target long prompt short-generation scenarios, I would like to ask whether you will take short prompt long generation into consideration in the future.

Thanks for your great work, that's my favorite KV compression paper!

dreaming-panda · 2025-02-10T09:11:58Z

short prompt long generation

It might become popular as reasoning models gain attention. However, I have not figured out what the most efficient and accurate way to design algorithms to do this is. Also, it seems reproducing the results of the long-generation models (e.g., R1 will generate 32K tokens to solve some math problems) is very time-consuming, and I did not even know what can be a good metric.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on Quest implemenation in KV cache selection #4

Question on Quest implemenation in KV cache selection #4

Monstertail commented Dec 21, 2024 •

edited

Loading

dreaming-panda commented Dec 22, 2024

Monstertail commented Feb 8, 2025

dreaming-panda commented Feb 10, 2025

Question on Quest implemenation in KV cache selection #4

Question on Quest implemenation in KV cache selection #4

Comments

Monstertail commented Dec 21, 2024 • edited Loading

dreaming-panda commented Dec 22, 2024

Monstertail commented Feb 8, 2025

dreaming-panda commented Feb 10, 2025

Monstertail commented Dec 21, 2024 •

edited

Loading