Optimize MoE topk with torch compile #3236

ispobock · 2025-01-31T17:00:04Z

Motivation

2 token/s faster by applying torch compile for topk function by default.

python3 -m sglang.bench_one_batch --batch-size 1 --input 128 --output 256 --model deepseek-ai/DeepSeek-V3 --trust-remote-code --tp 8

main branch:
Prefill. latency: 1.82421 s, throughput:     70.17 token/s
Decode.  latency: 1.75760 s, throughput:      0.57 token/s
Decode.  latency: 0.02740 s, throughput:     36.49 token/s
Decode.  latency: 0.02703 s, throughput:     37.00 token/s
Decode.  latency: 0.02711 s, throughput:     36.88 token/s
Decode.  latency: 0.02711 s, throughput:     36.89 token/s
Decode.  median latency: 0.02711 s, median throughput:     36.88 token/s
Total. latency:  3.745 s, throughput:     36.32 token/s
Benchmark ...
Prefill. latency: 0.16921 s, throughput:    756.48 token/s
Decode.  latency: 0.02716 s, throughput:     36.81 token/s
Decode.  latency: 0.02713 s, throughput:     36.86 token/s
Decode.  latency: 0.02713 s, throughput:     36.86 token/s
Decode.  latency: 0.02714 s, throughput:     36.85 token/s
Decode.  latency: 0.02716 s, throughput:     36.82 token/s
Decode.  median latency: 0.02719 s, median throughput:     36.78 token/s
Total. latency:  7.107 s, throughput:     54.03 token/s

this pr:
Prefill. latency: 1.79881 s, throughput:     71.16 token/s
Decode.  latency: 0.93208 s, throughput:      1.07 token/s
Decode.  latency: 0.02643 s, throughput:     37.84 token/s
Decode.  latency: 0.02598 s, throughput:     38.49 token/s
Decode.  latency: 0.02605 s, throughput:     38.38 token/s
Decode.  latency: 0.02609 s, throughput:     38.33 token/s
Decode.  median latency: 0.02605 s, median throughput:     38.38 token/s
Total. latency:  2.887 s, throughput:     47.10 token/s
Benchmark ...
Prefill. latency: 0.17031 s, throughput:    751.59 token/s
Decode.  latency: 0.02589 s, throughput:     38.62 token/s
Decode.  latency: 0.02599 s, throughput:     38.47 token/s
Decode.  latency: 0.02640 s, throughput:     37.88 token/s
Decode.  latency: 0.02594 s, throughput:     38.56 token/s
Decode.  latency: 0.02594 s, throughput:     38.55 token/s
Decode.  median latency: 0.02613 s, median throughput:     38.26 token/s
Total. latency:  6.822 s, throughput:     56.29 token/s

zhyncs · 2025-01-31T17:33:38Z

cc @merrymercy

compile topk function

cde01bc

ispobock requested review from merrymercy, Ying1123 and zhyncs as code owners January 31, 2025 17:00

Merge branch 'main' into topk-compile

5665b37

zhyncs approved these changes Jan 31, 2025

View reviewed changes

Merge branch 'main' into topk-compile

b9928ec

zhyncs merged commit 1ebe1d6 into sgl-project:main Jan 31, 2025
1 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize MoE topk with torch compile #3236

Optimize MoE topk with torch compile #3236

ispobock commented Jan 31, 2025

zhyncs commented Jan 31, 2025

Optimize MoE topk with torch compile #3236

Optimize MoE topk with torch compile #3236

Conversation

ispobock commented Jan 31, 2025

Motivation

zhyncs commented Jan 31, 2025