Tune paged attention parameters for AMD GPU. #3255

whchung · 2025-02-01T16:15:08Z

Motivation

Fine tune SGLang page attention kernel performance on AMD MI GPU for LLM.

Modifications

Changes:

num_kv_splits : 8 -> 16
BLOCK : 64 -> 8
num_warps : 2 -> 1 when kv_group_num is more than 1
waves_per_cu : 4 -> 1 in grouped paged attention kernel

These knobs have been tested with a couple of workloads on AMD ROCm platform.

Changes: - num_kv_splits - BLOCK - num_warps

Changed: - waves_per_eu

Changed: - num_stages

Make the tuning on the knobs only applicable on AMD ROCm platform via is_hip checks.

HaiShaw · 2025-02-02T01:25:33Z

python/sglang/srt/server_args.py

+        # AMD-specific Triton attention KV splits default number
+        if is_hip():
+            self.triton_attention_num_kv_splits = 16
+


@whchung we noticed that default 8 works better to most cases? Is 16 slightly better as you see, or better toward long sequences as you interested?

@this PR boosts grok, etc. as observed.

whchung added 2 commits February 1, 2025 10:10

Tune paged attention parameters for AMD GPU.

c145acb

Changes: - num_kv_splits - BLOCK - num_warps

Additional tuning for grouped page attention kernel.

9c5980e

Changed: - waves_per_eu

zhyncs requested a review from HaiShaw February 1, 2025 17:26

whchung and others added 3 commits February 1, 2025 11:30

Additional tuning for grouped paged attention kernel

eb34954

Changed: - num_stages

Segregate AMD-specific tuning with is_hip checks

5c095e5

Make the tuning on the knobs only applicable on AMD ROCm platform via is_hip checks.

Merge branch 'main' into whchung/exp-amd-pa

bf36b5b

whchung marked this pull request as ready for review February 1, 2025 23:09

whchung requested review from merrymercy, Ying1123, zhyncs, ispobock, hnyls2002 and ByronHsu as code owners February 1, 2025 23:09

andyluo7 mentioned this pull request Feb 2, 2025

[Bug] constant errors + hangs using sglang + deepseek v3 + AMD (httpcore.RemoteProtocolError: peer closed connection without sending complete message body (incomplete chunked read)) #3198

Open

5 tasks

tot0 pushed a commit to tot0/sglang that referenced this pull request Feb 2, 2025

Apply amd paged attention optimization, ref: sgl-project#3255

f2d173a

HaiShaw reviewed Feb 2, 2025

View reviewed changes

HaiShaw approved these changes Feb 2, 2025

View reviewed changes

HaiShaw merged commit d9eb935 into sgl-project:main Feb 2, 2025
10 checks passed

HaiShaw added the good first issue Good for newcomers label Feb 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tune paged attention parameters for AMD GPU. #3255

Tune paged attention parameters for AMD GPU. #3255

whchung commented Feb 1, 2025 •

edited

Loading

HaiShaw Feb 2, 2025

HaiShaw Feb 2, 2025 •

edited

Loading

Tune paged attention parameters for AMD GPU. #3255

Tune paged attention parameters for AMD GPU. #3255

Conversation

whchung commented Feb 1, 2025 • edited Loading

Motivation

Modifications

HaiShaw Feb 2, 2025

Choose a reason for hiding this comment

HaiShaw Feb 2, 2025 • edited Loading

Choose a reason for hiding this comment

whchung commented Feb 1, 2025 •

edited

Loading

HaiShaw Feb 2, 2025 •

edited

Loading