Fix block wise fp8 torch compile #3232

ispobock · 2025-01-31T11:54:54Z

Motivation

Fix torch compile for block wise fp8 linear layer.

python3 -m sglang.bench_one_batch --batch-size 1 --input 128 --output 256 --model deepseek-ai/DeepSeek-V3  --trust-remote-code --tp 8

Prefill. latency: 1.82421 s, throughput:     70.17 token/s
Decode.  latency: 1.75760 s, throughput:      0.57 token/s
Decode.  latency: 0.02740 s, throughput:     36.49 token/s
Decode.  latency: 0.02703 s, throughput:     37.00 token/s
Decode.  latency: 0.02711 s, throughput:     36.88 token/s
Decode.  latency: 0.02711 s, throughput:     36.89 token/s
Decode.  median latency: 0.02711 s, median throughput:     36.88 token/s
Total. latency:  3.745 s, throughput:     36.32 token/s
Benchmark ...
Prefill. latency: 0.16921 s, throughput:    756.48 token/s
Decode.  latency: 0.02716 s, throughput:     36.81 token/s
Decode.  latency: 0.02713 s, throughput:     36.86 token/s
Decode.  latency: 0.02713 s, throughput:     36.86 token/s
Decode.  latency: 0.02714 s, throughput:     36.85 token/s
Decode.  latency: 0.02716 s, throughput:     36.82 token/s
Decode.  median latency: 0.02719 s, median throughput:     36.78 token/s
Total. latency:  7.107 s, throughput:     54.03 token/s


python3 -m sglang.bench_one_batch --batch-size 1 --input 128 --output 256 --model deepseek-ai/DeepSeek-V3  --trust-remote-code --tp 8 --enable-torch-compile --torch-compile-max-bs 1
Prefill. latency: 1.85489 s, throughput:     69.01 token/s
Decode.  latency: 0.34461 s, throughput:      2.90 token/s
Decode.  latency: 0.02107 s, throughput:     47.46 token/s
Decode.  latency: 0.02078 s, throughput:     48.13 token/s
Decode.  latency: 0.02073 s, throughput:     48.23 token/s
Decode.  latency: 0.02075 s, throughput:     48.20 token/s
Decode.  median latency: 0.02078 s, median throughput:     48.13 token/s
Total. latency:  2.325 s, throughput:     58.50 token/s
Benchmark ...
Prefill. latency: 0.17728 s, throughput:    722.03 token/s
Decode.  latency: 0.02077 s, throughput:     48.15 token/s
Decode.  latency: 0.02075 s, throughput:     48.19 token/s
Decode.  latency: 0.02075 s, throughput:     48.19 token/s
Decode.  latency: 0.02075 s, throughput:     48.19 token/s
Decode.  latency: 0.02074 s, throughput:     48.22 token/s
Decode.  median latency: 0.02092 s, median throughput:     47.81 token/s
Total. latency:  5.497 s, throughput:     69.86 token/s

ispobock · 2025-01-31T11:56:34Z

Accuracy:

python3 benchmark/gsm8k/bench_sglang.py --num-questions 200 --parallel 1

Accuracy: 0.950
Invalid: 0.000
Latency: 452.187 s
Output throughput: 43.175 token/s

fix block wise fp8 torch compile

441e411

ispobock requested review from merrymercy, Ying1123 and zhyncs as code owners January 31, 2025 11:54

zhyncs approved these changes Jan 31, 2025

View reviewed changes

zhyncs merged commit c02e313 into sgl-project:main Jan 31, 2025
1 of 14 checks passed

zhyncs mentioned this pull request Jan 31, 2025

chore: bump v0.4.2.post1 #3233

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix block wise fp8 torch compile #3232

Fix block wise fp8 torch compile #3232

ispobock commented Jan 31, 2025 •

edited

Loading

ispobock commented Jan 31, 2025

Fix block wise fp8 torch compile #3232

Fix block wise fp8 torch compile #3232

Conversation

ispobock commented Jan 31, 2025 • edited Loading

Motivation

ispobock commented Jan 31, 2025

ispobock commented Jan 31, 2025 •

edited

Loading