You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During the ngram-spec-decode stage, I've always had a question: In RAG, there isn't just one document relevant to the answer; why don't we first let the large model generate 3 tokens, and then take all possible results in the N-gram?
In simpler terms, imagine you're looking for an object in several rooms but can only carry three things at once. You might want to pick up some important items now so you won't forget them when carrying more stuff later. This way, you make sure your search is efficient and effective.
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
You can do this, and in fact this is something we want to do in vLLM. The primary tradeoff is that verification of additional speculative tokens costs FLOPs, so if you don't pick carefully you can end up hurting overall performance (especially as number of concurrent requests grows). See #4565 for more details here.
In terms of implementing this in vLLM, we need #3960 first.
cadedaniel
changed the title
[Feature]: ngram-spec-decode
[Feature]: Evaluate multiple ngram speculations in speculative decoding
Jul 25, 2024
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
🚀 The feature, motivation and pitch
During the ngram-spec-decode stage, I've always had a question: In RAG, there isn't just one document relevant to the answer; why don't we first let the large model generate 3 tokens, and then take all possible results in the N-gram?
In simpler terms, imagine you're looking for an object in several rooms but can only carry three things at once. You might want to pick up some important items now so you won't forget them when carrying more stuff later. This way, you make sure your search is efficient and effective.
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: