-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model][Speculative Decoding] DeepSeek MTP spec decode #12755
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM. It's pretty clean so no concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know how long does it take to run all tests in this file?
inputs_embeds: Optional[torch.Tensor] = None, | ||
) -> torch.Tensor: | ||
if inputs_embeds is None: | ||
inputs_embeds = self.embed_tokens(input_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: where did we truncate the input_ids?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for 1st stage: position 0 is masked for MTP, but it only applies to k=1, I need to change the mask to the [position<=k-1],
for 2+ stage, previous tokens in last stage is marked pre-computed, this is a bit complicated for k>1 on different layers, need to look into.
in short, the current change works for k=1 (which deepseek v3 model set), but need more changes for k>1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: what's the shape of input_ids
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the incremental length, here its [B*1]
d5a9924
to
793ef4f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any way to put a MD file instructing examples on how to use the MTP for SD?
Especially,
- The
num_nextn_predict_layers
is 1, can we specify speculation length more than 1? and what are requirements on formatting the draft model artifacts? - Is this code compatible with multi-node inference? assume so as the draft is loaded in single GPU?
|
||
# max. number of speculative tokens: this corresponds to | ||
# num_heads in the config.json of the speculator model. | ||
MAX_SPEC_TOKENS = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The num_nextn_predict_layers
in DeepSeek V3 has only 1. Will that mean you will reuse the MTP head if I specify MAX_SEC_TOKENS more than 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a test file on dummy model. num_speculative_tokens should be <= num_nextn_predict_layers, the transformer blocks are different in different steps. I am adding some assertion for this case when user pass higher number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to just re-use the MTP to predict tokens whose k > 1? as essentially they are the same right?
You can print some warning that this is not expected though.
def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): | ||
super().__init__() | ||
config = vllm_config.model_config.hf_config | ||
self.mtp_start_layer_idx = config.num_hidden_layers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't the mtp_start_layer_idx
be num_hidden_layers
-1?
num_hidden_layers
is 61 in DeepSeek config. The index of last layer is 60.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://huggingface.co/deepseek-ai/DeepSeek-V3/raw/main/model.safetensors.index.json the last layer is 61 which is the mtp layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://huggingface.co/deepseek-ai/DeepSeek-V3/raw/main/model.safetensors.index.json the last layer is 61 which is the mtp layer.
I see, thanks for clarifying!
@luccafong Sorry have to ask those questions as I hope to use your implementation.
|
@luccafong I have been working on a similar implementation locally, and have faced a few challenges that I'm not sure are addressed here. Have you validated the acceptance rate for k=1 for real weights? I believe that the final RMSNorm in the DeepSeekV3 main model is not necessary for speculative decoding since the Also, I have noticed the acceptance rate becomes very low (<50%) when I enable the recently added MLA attention. Have you noticed this also? I am not sure what could cause this, maybe it is a bug fixed in recent commits to vLLM. I would like to know if this is an issue for your implementation. |
The accept rate is around 56% during my testing, MLA attention could lead to different branch, regarding the norm, thanks for pointing out, let me try adjusting to see if there is an improvement. |
1.Not tested with multi node settings; 2. We can reuse if you do some model processing, e.g. duplicate the weights to different layers, the hacky changes will not be proper since for n predict layers >1, and we do a k > n predict layers, it is difficult to decide which layer to forward multiple times. |
This pull request has merge conflicts that must be resolved before it can be |
if inputs_embeds is None: | ||
inputs_embeds = self.embed_tokens(input_ids) | ||
assert inputs_embeds is not None | ||
inputs_embeds[positions <= spec_step_index] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please excuse my multiple questions.
inputs_embeds[positions <= spec_step_index] = 0
is for pre-filling stage for each MTP head correct? as during draft model (MTP head) decoding stage, the inputs_embeds is a single hidden vector.
That is what I saw in EAGLE workflow. It firstly enters the code from here with num_steps
being 1 for prefilling (that is where the mask is effective). Then the num_steps
becomes to be speculation length k
and inputs_embed
for each forward pass is a single embed vector.
But I didn't see your logic is modified in
vllm/vllm/spec_decode/draft_model_runner.py
Line 271 in 467a96a
for step in range(num_steps): |
spec_step_index
, where do you introduce it then?
cc0410d
to
f0d3627
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just left some comments, will finish review by EOD.
@@ -1925,11 +1946,13 @@ def _verify_and_get_draft_model_tensor_parallel_size( | |||
# If speculative_draft_tensor_parallel_size is unset then set it | |||
# appropriately else verify that it is set correctly. | |||
if speculative_draft_tensor_parallel_size is None: | |||
if draft_hf_config.model_type == "mlp_speculator": | |||
if draft_hf_config.model_type in ("mlp_speculator", | |||
"deepseek_mtp"): | |||
speculative_draft_tensor_parallel_size = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we run drafting head with TP1 by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is remove now
) -> None: | ||
super().__init__() | ||
self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) | ||
self.head = ParallelLMHead(config.vocab_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this field used anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is used in Line 148:
logits = self.logits_processor(mtp_layer.shared_head.head,
inputs_embeds: Optional[torch.Tensor] = None, | ||
) -> torch.Tensor: | ||
if inputs_embeds is None: | ||
inputs_embeds = self.embed_tokens(input_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: what's the shape of input_ids
here?
model_output: List[SamplerOutput] = self.worker.execute_model( | ||
execute_model_req=expanded_request) | ||
expanded_request.spec_step_idx += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: what's the purpose of spec_step_idx
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks to me like it's used to track which MTP module to call (therefore which layer to dispatch during eval).
Is this implementation compatible with CUDA graphs? If so, how?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this branch will be used by TP > 1, and the spec_step_idx can only be passed in this way since we are using the normal model runner. @benchislett good point, for now k=1 is not a problem, we need to work out k>1 solution and add some changes to cudagraph build, which I plan to do that in luccafong#1.
@@ -1649,6 +1649,7 @@ def execute_model( | |||
kv_caches: List[torch.Tensor], | |||
intermediate_tensors: Optional[IntermediateTensors] = None, | |||
num_steps: int = 1, | |||
**kwargs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused kwargs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to support TP > 1, it is used to pass previous hidden states in latest codes
db5146b
to
511db7b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the work!!
Signed-off-by: Lu Fang <[email protected]>
Signed-off-by: Lu Fang <[email protected]>
Signed-off-by: Lu Fang <[email protected]>
Signed-off-by: Lu Fang <[email protected]>
Signed-off-by: Lu Fang <[email protected]>
Signed-off-by: Lu Fang <[email protected]>
Signed-off-by: Lu Fang <[email protected]>
Signed-off-by: Lu Fang <[email protected]>
Signed-off-by: Lu Fang <[email protected]>
Signed-off-by: Lu Fang <[email protected]>
Signed-off-by: Lu Fang <[email protected]>
Signed-off-by: Lu Fang <[email protected]>
Implement DeepSeek MTP: #12181 to support DeepSeek MTP layers for next n prediction.
Online Serving
Add
--num-speculative-tokens 1
for DeepSeek V3/R1:Offline Inference
Set num_speculative_tokens = 1
Note: This implementation validates on MTP k=1 models only.
Benchmark Results
The acceptance rate is 81% ~ 82.3% on R1 k=1.
The speedup depends on the QPS, with 1.63x speedup for QPS=1 and certain improvement with QPS<8 as shown in below table.
Results on various QPS
Draft TP=1
Draft TP=8
Results on various Concurrency
Draft TP=8