-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMD] [Model] DeepSeek tunings #13199
Conversation
Signed-off-by: Randall Smith <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Randall Smith <[email protected]>
Signed-off-by: Randall Smith <[email protected]>
Signed-off-by: Randall Smith <[email protected]>
Signed-off-by: Randall Smith <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between MI300XHF_OAM and MI300X_OAM? Also it seems in currently existing configs "_OAM" doesn't exist for MI300X, so why is it included now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This filename still has the space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Still have OAM in it, and XHF too.
vllm/model_executor/layers/fused_moe/configs/E=256,N=256,device_name=AMD_Instinct_MI300XHF_OAM,dtype=fp8_w8a8,block_shape=[128,128].json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Still have OAM in it, and XHF too.
vllm/model_executor/layers/fused_moe/configs/E=256,N=256,device_name=AMD_Instinct_MI300XHF_OAM,dtype=fp8_w8a8,block_shape=[128,128].json
They don't have Divakar's changes yet.
benchmarks/kernels/benchmark_moe.py
Outdated
save_configs(best_configs, E, shard_intermediate_size, hidden_size, | ||
topk, dtype, use_fp8_w8a8, use_int8_w8a16) | ||
topk, dtype, use_fp8_w8a8, use_int8_w8a16, | ||
block_quant_shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you only save the block_quant_shape in the config but don't consider it for the kernel tuning - I would think this is important for the tuner to have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was added in by mistake. I meant to just add the tunings that worked for us.
Signed-off-by: Randall Smith <[email protected]>
Signed-off-by: Randall Smith <[email protected]>
Signed-off-by: Randall Smith <[email protected]>
Turns out we don't need the XHF files, so I removed those. |
Signed-off-by: Randall Smith <[email protected]>
Signed-off-by: Randall Smith <[email protected]>
This PR adds some tunings to improve ROCm performance for DeepSeek.
latency command: VLLM_ENABLE_MOE_ALIGN_BLOCK_SIZE_TRITON=1 VLLM_FP8_PADDING=0 python3 benchmarks/benchmark_latency.py --model "deepseek-ai/DeepSeek-V3" -tp 8 --trust-remote-code --max-model-len 32768 --load-format "dummy" --input-len 128 --output-len 32 --batch-size 32 --num-iters 5 --num-iters-warmup 2
Before:
PPL=3.090357290962561
Avg latency: 3.0797035929979755 seconds
After:
PPL=3.0888253522260
Avg latency: 2.5471135329920798 seconds