Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix block wise fp8 torch compile #3232

Merged
merged 1 commit into from
Jan 31, 2025
Merged

Conversation

ispobock
Copy link
Collaborator

@ispobock ispobock commented Jan 31, 2025

Motivation

Fix torch compile for block wise fp8 linear layer.

python3 -m sglang.bench_one_batch --batch-size 1 --input 128 --output 256 --model deepseek-ai/DeepSeek-V3  --trust-remote-code --tp 8

Prefill. latency: 1.82421 s, throughput:     70.17 token/s
Decode.  latency: 1.75760 s, throughput:      0.57 token/s
Decode.  latency: 0.02740 s, throughput:     36.49 token/s
Decode.  latency: 0.02703 s, throughput:     37.00 token/s
Decode.  latency: 0.02711 s, throughput:     36.88 token/s
Decode.  latency: 0.02711 s, throughput:     36.89 token/s
Decode.  median latency: 0.02711 s, median throughput:     36.88 token/s
Total. latency:  3.745 s, throughput:     36.32 token/s
Benchmark ...
Prefill. latency: 0.16921 s, throughput:    756.48 token/s
Decode.  latency: 0.02716 s, throughput:     36.81 token/s
Decode.  latency: 0.02713 s, throughput:     36.86 token/s
Decode.  latency: 0.02713 s, throughput:     36.86 token/s
Decode.  latency: 0.02714 s, throughput:     36.85 token/s
Decode.  latency: 0.02716 s, throughput:     36.82 token/s
Decode.  median latency: 0.02719 s, median throughput:     36.78 token/s
Total. latency:  7.107 s, throughput:     54.03 token/s


python3 -m sglang.bench_one_batch --batch-size 1 --input 128 --output 256 --model deepseek-ai/DeepSeek-V3  --trust-remote-code --tp 8 --enable-torch-compile --torch-compile-max-bs 1
Prefill. latency: 1.85489 s, throughput:     69.01 token/s
Decode.  latency: 0.34461 s, throughput:      2.90 token/s
Decode.  latency: 0.02107 s, throughput:     47.46 token/s
Decode.  latency: 0.02078 s, throughput:     48.13 token/s
Decode.  latency: 0.02073 s, throughput:     48.23 token/s
Decode.  latency: 0.02075 s, throughput:     48.20 token/s
Decode.  median latency: 0.02078 s, median throughput:     48.13 token/s
Total. latency:  2.325 s, throughput:     58.50 token/s
Benchmark ...
Prefill. latency: 0.17728 s, throughput:    722.03 token/s
Decode.  latency: 0.02077 s, throughput:     48.15 token/s
Decode.  latency: 0.02075 s, throughput:     48.19 token/s
Decode.  latency: 0.02075 s, throughput:     48.19 token/s
Decode.  latency: 0.02075 s, throughput:     48.19 token/s
Decode.  latency: 0.02074 s, throughput:     48.22 token/s
Decode.  median latency: 0.02092 s, median throughput:     47.81 token/s
Total. latency:  5.497 s, throughput:     69.86 token/s

@zhyncs zhyncs merged commit c02e313 into sgl-project:main Jan 31, 2025
1 of 14 checks passed
@ispobock
Copy link
Collaborator Author

Accuracy:

python3 benchmark/gsm8k/bench_sglang.py --num-questions 200 --parallel 1

Accuracy: 0.950
Invalid: 0.000
Latency: 452.187 s
Output throughput: 43.175 token/s

@zhyncs zhyncs mentioned this pull request Jan 31, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants