[Model][Speculative Decoding] DeepSeek MTP spec decode #12755

luccafong · 2025-02-04T23:58:11Z

Implement DeepSeek MTP: #12181 to support DeepSeek MTP layers for next n prediction.

Online Serving
Add --num-speculative-tokens 1 for DeepSeek V3/R1:

python -m vllm.entrypoints.openai.api_server --disable-log-requests --gpu-memory-utilization 0.8  --max-model-len 65536 --max-num-seqs 128 --seed 0 --tensor-parallel-size 8 --model deepseek-ai/DeepSeek-R1 ---trust-remote-code --num-speculative-tokens 1

Offline Inference
Set num_speculative_tokens = 1

llm = LLM(
    model="deepseek-ai/DeepSeek-R1",
    tensor_parallel_size=8,
    max_model_len=8192, # If you have enough memory with your hardware, you can ignore this
    num_speculative_tokens=1, # only 1 is supported for now
   draft_tensor_parallel_size=8, # optional, by default it will be the same as tensor_parallel_size.
)

Note: This implementation validates on MTP k=1 models only.

Benchmark Results

The acceptance rate is 81% ~ 82.3% on R1 k=1.
The speedup depends on the QPS, with 1.63x speedup for QPS=1 and certain improvement with QPS<8 as shown in below table.

Results on various QPS

Draft TP=1

QPS	Baseline TPOT	k=1 TPOT	Speedup
1	55.47	33.99	1.63x
2	57.58	48.8	1.18x
4	64.29	51.02	1.26x
6	122.93	108.15	1.14x
8	120.18	119.14	1.0x

Draft TP=8

QPS	Baseline TPOT	k=1,TP=8 TPOT	Speedup
1	55.47	32.64	1.69x
2	57.58	43.6	1.32x
4	64.29	52.62	1.22x
6	122.93	129.5	< 1.0
8	120.18	139.49	< 1.0

Results on various Concurrency
Draft TP=8

MAX_CONCURRENCY	Baseline TPOT	k=1 TPOT	Speedup
1	23.13	17.24	1.34x
2	28.10	17.07	1.64x
4	27.57	21.48	1.28x
8	38.57	34.62	1.11x
16	50.24	40.89	1.22x
32	70.88	56.63	1.25x

github-actions · 2025-02-04T23:58:22Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

comaniac

Otherwise LGTM. It's pretty clean so no concerns.

comaniac · 2025-02-05T00:49:05Z

tests/spec_decode/e2e/test_mtp_correctness.py

Do you know how long does it take to run all tests in this file?

vllm/spec_decode/draft_model_runner.py

vllm/spec_decode/spec_decode_worker.py

vllm/transformers_utils/configs/__init__.py

vllm/transformers_utils/configs/deepseek_mtp.py

vllm/worker/worker.py

LiuXiaoxuanPKU · 2025-02-05T02:01:21Z

vllm/model_executor/models/deepseek_mtp.py

+        inputs_embeds: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)


QQ: where did we truncate the input_ids?

for 1st stage: position 0 is masked for MTP, but it only applies to k=1, I need to change the mask to the [position<=k-1],
for 2+ stage, previous tokens in last stage is marked pre-computed, this is a bit complicated for k>1 on different layers, need to look into.

in short, the current change works for k=1 (which deepseek v3 model set), but need more changes for k>1

QQ: what's the shape of input_ids here?

the incremental length, here its [B*1]

Neo9061

Any way to put a MD file instructing examples on how to use the MTP for SD?

Especially,

The num_nextn_predict_layers is 1, can we specify speculation length more than 1? and what are requirements on formatting the draft model artifacts?
Is this code compatible with multi-node inference? assume so as the draft is loaded in single GPU?

Neo9061 · 2025-02-05T03:10:47Z

tests/spec_decode/e2e/test_mtp_correctness.py

+
+# max. number of speculative tokens: this corresponds to
+# num_heads in the config.json of the speculator model.
+MAX_SPEC_TOKENS = 3


The num_nextn_predict_layers in DeepSeek V3 has only 1. Will that mean you will reuse the MTP head if I specify MAX_SEC_TOKENS more than 1?

this is a test file on dummy model. num_speculative_tokens should be <= num_nextn_predict_layers, the transformer blocks are different in different steps. I am adding some assertion for this case when user pass higher number.

Is there a way to just re-use the MTP to predict tokens whose k > 1? as essentially they are the same right?

You can print some warning that this is not expected though.

Neo9061 · 2025-02-05T04:01:46Z

vllm/model_executor/models/deepseek_mtp.py

+    def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
+        super().__init__()
+        config = vllm_config.model_config.hf_config
+        self.mtp_start_layer_idx = config.num_hidden_layers


shouldn't the mtp_start_layer_idx be num_hidden_layers -1?

num_hidden_layers is 61 in DeepSeek config. The index of last layer is 60.

https://huggingface.co/deepseek-ai/DeepSeek-V3/raw/main/model.safetensors.index.json the last layer is 61 which is the mtp layer.

https://huggingface.co/deepseek-ai/DeepSeek-V3/raw/main/model.safetensors.index.json the last layer is 61 which is the mtp layer.

I see, thanks for clarifying!

Neo9061 · 2025-02-05T16:39:01Z

@luccafong Sorry have to ask those questions as I hope to use your implementation.

Have you tested it e2e with VLLM's multi-node distributed inference setting? asking as I can only deploy the model in multi-node settings.
If I want to re-use the MTP head to do speculation length k > 1, what is the hacking implementation you would recommend to just make it work? as k=1 is too limited in my application.

benchislett · 2025-02-05T18:30:43Z

@luccafong I have been working on a similar implementation locally, and have faced a few challenges that I'm not sure are addressed here. Have you validated the acceptance rate for k=1 for real weights?

I believe that the final RMSNorm in the DeepSeekV3 main model is not necessary for speculative decoding since the hnorm already normalizes the previous hidden weights received from the main model. It's unclear to me how it is classified in the DeepSeek-V3 technical report, but I think that the norm might be included in the output head and therefore not normalized as input to the MTP module. Anecdotally, I observe a small increase in acceptance rate with this change.

Also, I have noticed the acceptance rate becomes very low (<50%) when I enable the recently added MLA attention. Have you noticed this also? I am not sure what could cause this, maybe it is a bug fixed in recent commits to vLLM. I would like to know if this is an issue for your implementation.

vllm/transformers_utils/configs/deepseek_v3.py

luccafong · 2025-02-06T04:54:01Z

@luccafong I have been working on a similar implementation locally, and have faced a few challenges that I'm not sure are addressed here. Have you validated the acceptance rate for k=1 for real weights?

I believe that the final RMSNorm in the DeepSeekV3 main model is not necessary for speculative decoding since the hnorm already normalizes the previous hidden weights received from the main model. It's unclear to me how it is classified in the DeepSeek-V3 technical report, but I think that the norm might be included in the output head and therefore not normalized as input to the MTP module. Anecdotally, I observe a small increase in acceptance rate with this change.

Also, I have noticed the acceptance rate becomes very low (<50%) when I enable the recently added MLA attention. Have you noticed this also? I am not sure what could cause this, maybe it is a bug fixed in recent commits to vLLM. I would like to know if this is an issue for your implementation.

The accept rate is around 56% during my testing, MLA attention could lead to different branch,
https://github.com/luccafong/vllm/blob/ds_mtp/vllm/spec_decode/multi_step_worker.py#L98 I fixed in a later commit.

regarding the norm, thanks for pointing out, let me try adjusting to see if there is an improvement.

luccafong · 2025-02-06T04:57:29Z

@luccafong Sorry have to ask those questions as I hope to use your implementation.

Have you tested it e2e with VLLM's multi-node distributed inference setting? asking as I can only deploy the model in multi-node settings.

If I want to re-use the MTP head to do speculation length k > 1, what is the hacking implementation you would recommend to just make it work? as k=1 is too limited in my application.

1.Not tested with multi node settings; 2. We can reuse if you do some model processing, e.g. duplicate the weights to different layers, the hacky changes will not be proper since for n predict layers >1, and we do a k > n predict layers, it is difficult to decide which layer to forward multiple times.
Note for now as commented in the other thread, some changes are needed for K>1, I am working in progress, let me update with you if it works.

mergify · 2025-02-06T07:17:03Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @luccafong.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Neo9061 · 2025-02-06T17:46:27Z

vllm/model_executor/models/deepseek_mtp.py

+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)
+        assert inputs_embeds is not None
+        inputs_embeds[positions <= spec_step_index] = 0


Please excuse my multiple questions.

inputs_embeds[positions <= spec_step_index] = 0 is for pre-filling stage for each MTP head correct? as during draft model (MTP head) decoding stage, the inputs_embeds is a single hidden vector.

That is what I saw in EAGLE workflow. It firstly enters the code from here with num_steps being 1 for prefilling (that is where the mask is effective). Then the num_steps becomes to be speculation length k and inputs_embed for each forward pass is a single embed vector.

But I didn't see your logic is modified in

vllm/vllm/spec_decode/draft_model_runner.py

Line 271 in 467a96a

for step in range(num_steps):

to introduce spec_step_index, where do you introduce it then?

LiuXiaoxuanPKU

Just left some comments, will finish review by EOD.

LiuXiaoxuanPKU · 2025-02-13T00:44:09Z

vllm/config.py

@@ -1925,11 +1946,13 @@ def _verify_and_get_draft_model_tensor_parallel_size(
        # If speculative_draft_tensor_parallel_size is unset then set it
        # appropriately else verify that it is set correctly.
        if speculative_draft_tensor_parallel_size is None:
-            if draft_hf_config.model_type == "mlp_speculator":
+            if draft_hf_config.model_type in ("mlp_speculator",
+                                              "deepseek_mtp"):
                speculative_draft_tensor_parallel_size = 1


It seems we run drafting head with TP1 by default?

it is remove now

LiuXiaoxuanPKU · 2025-02-13T00:45:49Z

vllm/model_executor/models/deepseek_mtp.py

+    ) -> None:
+        super().__init__()
+        self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.head = ParallelLMHead(config.vocab_size,


Is this field used anywhere?

it is used in Line 148:
logits = self.logits_processor(mtp_layer.shared_head.head,

LiuXiaoxuanPKU · 2025-02-13T00:54:10Z

vllm/model_executor/models/deepseek_mtp.py

+        inputs_embeds: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)


QQ: what's the shape of input_ids here?

vllm/model_executor/models/deepseek_mtp.py

LiuXiaoxuanPKU · 2025-02-13T08:05:31Z

vllm/spec_decode/multi_step_worker.py

                model_output: List[SamplerOutput] = self.worker.execute_model(
                    execute_model_req=expanded_request)
+                expanded_request.spec_step_idx += 1


QQ: what's the purpose of spec_step_idx?

It looks to me like it's used to track which MTP module to call (therefore which layer to dispatch during eval).

Is this implementation compatible with CUDA graphs? If so, how?

this branch will be used by TP > 1, and the spec_step_idx can only be passed in this way since we are using the normal model runner. @benchislett good point, for now k=1 is not a problem, we need to work out k>1 solution and add some changes to cudagraph build, which I plan to do that in luccafong#1.

vllm/spec_decode/spec_decode_worker.py

LiuXiaoxuanPKU · 2025-02-13T08:08:24Z

vllm/worker/model_runner.py

@@ -1649,6 +1649,7 @@ def execute_model(
        kv_caches: List[torch.Tensor],
        intermediate_tensors: Optional[IntermediateTensors] = None,
        num_steps: int = 1,
+        **kwargs,


unused kwargs

updated to support TP > 1, it is used to pass previous hidden states in latest codes

LiuXiaoxuanPKU

LGTM, thanks for the work!!

Signed-off-by: Lu Fang <[email protected]>

luccafong requested review from njhill, LiuXiaoxuanPKU, DarkLight1337, ywang96, zhuohan123, youkaichao, alexm-redhat and comaniac as code owners February 4, 2025 23:58

mergify bot added the speculative-decoding label Feb 4, 2025

luccafong force-pushed the ds_mtp branch from fd7ff28 to d8a95cf Compare February 5, 2025 00:06

LiuXiaoxuanPKU self-assigned this Feb 5, 2025

luccafong force-pushed the ds_mtp branch from d8a95cf to 9de0bdf Compare February 5, 2025 00:27

comaniac reviewed Feb 5, 2025

View reviewed changes

LiuXiaoxuanPKU reviewed Feb 5, 2025

View reviewed changes

lambert0312 mentioned this pull request Feb 5, 2025

[Feature] DeepSeek V3 optimization sgl-project/sglang#2591

Open

18 tasks

luccafong force-pushed the ds_mtp branch 3 times, most recently from d5a9924 to 793ef4f Compare February 5, 2025 02:30

Neo9061 reviewed Feb 5, 2025

View reviewed changes

simon-mo reviewed Feb 5, 2025

View reviewed changes

vllm/transformers_utils/configs/deepseek_v3.py Outdated Show resolved Hide resolved

luccafong force-pushed the ds_mtp branch from 793ef4f to 9afc75e Compare February 5, 2025 23:47

mergify bot added the needs-rebase label Feb 6, 2025

Neo9061 reviewed Feb 6, 2025

View reviewed changes

luccafong force-pushed the ds_mtp branch 2 times, most recently from cc0410d to f0d3627 Compare February 13, 2025 00:04

LiuXiaoxuanPKU reviewed Feb 13, 2025

View reviewed changes

luccafong force-pushed the ds_mtp branch 8 times, most recently from db5146b to 511db7b Compare February 14, 2025 11:23

LiuXiaoxuanPKU approved these changes Feb 14, 2025

View reviewed changes

simon-mo added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 14, 2025

luccafong force-pushed the ds_mtp branch from 511db7b to 36d892f Compare February 14, 2025 22:05

mgoin approved these changes Feb 15, 2025

View reviewed changes

luccafong force-pushed the ds_mtp branch from 36dd34d to 1809d96 Compare February 15, 2025 22:45

luccafong added 12 commits February 15, 2025 21:23

support DeepSeek MTP spec decode

858dd47

Signed-off-by: Lu Fang <[email protected]>

cleanup and add comments

6b510b6

Signed-off-by: Lu Fang <[email protected]>

fix multi step and support same model checkpoint for main and spec model

ca24d90

Signed-off-by: Lu Fang <[email protected]>

allow mla for gpu multi step spec decode

ffef3b3

Signed-off-by: Lu Fang <[email protected]>

migrate changes to deepseekv3 and remove unused config

e558247

Signed-off-by: Lu Fang <[email protected]>

support tp > 1

a80ae95

Signed-off-by: Lu Fang <[email protected]>

address comments

f5cd6fb

Signed-off-by: Lu Fang <[email protected]>

reformatting

60e0b68

Signed-off-by: Lu Fang <[email protected]>

fix CI issue

509e405

Signed-off-by: Lu Fang <[email protected]>

fix TP>1 accuracy issue

82085c6

Signed-off-by: Lu Fang <[email protected]>

fix model load test

68bdda4

Signed-off-by: Lu Fang <[email protected]>

fix oom of mtp tests

0410f86

Signed-off-by: Lu Fang <[email protected]>

luccafong force-pushed the ds_mtp branch from 1809d96 to 0410f86 Compare February 16, 2025 05:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model][Speculative Decoding] DeepSeek MTP spec decode #12755

[Model][Speculative Decoding] DeepSeek MTP spec decode #12755

luccafong commented Feb 4, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Feb 4, 2025

comaniac left a comment

comaniac Feb 5, 2025

LiuXiaoxuanPKU Feb 5, 2025

luccafong Feb 5, 2025

LiuXiaoxuanPKU Feb 13, 2025

luccafong Feb 13, 2025

Neo9061 left a comment •

edited

Loading

Neo9061 Feb 5, 2025

luccafong Feb 5, 2025

Neo9061 Feb 5, 2025

Neo9061 Feb 5, 2025

luccafong Feb 5, 2025

Neo9061 Feb 5, 2025

Neo9061 commented Feb 5, 2025

benchislett commented Feb 5, 2025

luccafong commented Feb 6, 2025

luccafong commented Feb 6, 2025 •

edited

Loading

mergify bot commented Feb 6, 2025

Neo9061 Feb 6, 2025

LiuXiaoxuanPKU left a comment

LiuXiaoxuanPKU Feb 13, 2025

luccafong Feb 14, 2025

LiuXiaoxuanPKU Feb 13, 2025

luccafong Feb 13, 2025

LiuXiaoxuanPKU Feb 13, 2025

LiuXiaoxuanPKU Feb 13, 2025

benchislett Feb 13, 2025

luccafong Feb 13, 2025

LiuXiaoxuanPKU Feb 13, 2025

luccafong Feb 13, 2025 •

edited

Loading

LiuXiaoxuanPKU left a comment

[Model][Speculative Decoding] DeepSeek MTP spec decode #12755

Are you sure you want to change the base?

[Model][Speculative Decoding] DeepSeek MTP spec decode #12755

Conversation

luccafong commented Feb 4, 2025 • edited by github-actions bot Loading

github-actions bot commented Feb 4, 2025

comaniac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Neo9061 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Neo9061 commented Feb 5, 2025

benchislett commented Feb 5, 2025

luccafong commented Feb 6, 2025

luccafong commented Feb 6, 2025 • edited Loading

mergify bot commented Feb 6, 2025

Choose a reason for hiding this comment

LiuXiaoxuanPKU left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luccafong Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

LiuXiaoxuanPKU left a comment

Choose a reason for hiding this comment

luccafong commented Feb 4, 2025 •

edited by github-actions bot

Loading

Neo9061 left a comment •

edited

Loading

luccafong commented Feb 6, 2025 •

edited

Loading

luccafong Feb 13, 2025 •

edited

Loading