optimize moe_align_kernel cuda #3347

BBuf · 2025-02-06T14:53:52Z

Thanks to @tim-zou help in #3339.

DeepSeek V3 end2end benchmark

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code
python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8

After starting the service, I ran benchmark/gsm8k/bench_sglang.py twice. The results shown below are from the second run.

main branch:

Accuracy: 0.953
Invalid: 0.000
Latency: 57.095 s
Output throughput: 2429.956 token/s

pr:

Accuracy: 0.952
Invalid: 0.000
Latency: 56.672 s
Output throughput: 2432.876 token/s

It's possible that the number of tokens in this dataset did not trigger the bad case for the moe_align_block_size kernel, meaning the token count was less than 4096. Therefore, this offline benchmark shows little difference, but it can verify the consistency of accuracy.

micro benchmark in H200

main branch:

✅ CUDA and Triton implementations match
moe-align-block-size-performance:
     batch_size  seq_len          CUDA        Triton
0           1.0      1.0     22.336001     41.983999
1           1.0      2.0     22.336001     41.951999
2           1.0      4.0     22.336001     42.016000
3           1.0      8.0     22.368001     42.176001
4           1.0     16.0     22.560000     42.431999
5           1.0     32.0     22.784000     42.656001
6           1.0     64.0     23.200000     43.744002
7           1.0    128.0     23.647999     43.232001
8           1.0    256.0     24.512000     45.024000
9           1.0    512.0     26.688000     49.568001
10          1.0   1024.0     31.456001     54.832000
11          1.0   2048.0     44.480000     63.135996
12          1.0   4096.0     84.959999     77.023998
13          1.0   8192.0    147.615999    100.128002
14          1.0  16384.0    274.975985    146.431997
15          1.0  32768.0    528.128028    241.439998
16          2.0      1.0     22.336001     41.951999
17          2.0      2.0     22.336001     41.983999
18          2.0      4.0     22.496000     42.176001
19          2.0      8.0     22.560000     42.208001
20          2.0     16.0     22.784000     42.592000
21          2.0     32.0     23.200000     43.552000
22          2.0     64.0     23.647999     43.168001
23          2.0    128.0     24.512000     45.120001
24          2.0    256.0     26.688000     49.632002
25          2.0    512.0     31.456001     54.880001
26          2.0   1024.0     44.447999     63.231997
27          2.0   2048.0     84.512003     76.959997
28          2.0   4096.0    148.192003    100.143999
29          2.0   8192.0    273.440003    146.400005
30          2.0  16384.0    528.576016    241.408005
31          2.0  32768.0   1030.447960    415.295988
32          4.0      1.0     22.336001     42.176001
33          4.0      2.0     22.399999     42.240001
34          4.0      4.0     22.592001     42.431999
35          4.0      8.0     22.624001     42.656001
36          4.0     16.0     23.232000     43.776002
37          4.0     32.0     23.680000     43.232001
38          4.0     64.0     24.512000     44.992000
39          4.0    128.0     26.688000     49.440000
40          4.0    256.0     31.520002     54.944001
41          4.0    512.0     44.351999     63.359998
42          4.0   1024.0     84.703997     77.151999
43          4.0   2048.0    147.551998    100.096002
44          4.0   4096.0    274.080008    146.272004
45          4.0   8192.0    527.872026    241.472006
46          4.0  16384.0   1029.456019    415.360004
47          4.0  32768.0   2033.727884    775.168002
48          8.0      1.0     22.336001     42.240001
49          8.0      2.0     22.592001     42.304002
50          8.0      4.0     22.655999     42.720001
51          8.0      8.0     23.232000     43.776002
52          8.0     16.0     23.680000     43.232001
53          8.0     32.0     24.544001     45.088001
54          8.0     64.0     26.720000     49.632002
55          8.0    128.0     31.488001     55.135999
56          8.0    256.0     44.383999     63.311994
57          8.0    512.0     84.959999     77.280000
58          8.0   1024.0    147.951990    100.224003
59          8.0   2048.0    273.920000    146.752000
60          8.0   4096.0    528.768003    241.536006
61          8.0   8192.0   1030.335903    415.199995
62          8.0  16384.0   2033.024073    775.103986
63          8.0  32768.0   4020.607948   1489.856005
64         16.0      1.0     22.560000     42.304002
65         16.0      2.0     22.624001     42.592000
66         16.0      4.0     23.232000     43.616001
67         16.0      8.0     23.680000     43.296002
68         16.0     16.0     24.544001     44.992000
69         16.0     32.0     26.720000     49.791999
70         16.0     64.0     31.488001     55.103999
71         16.0    128.0     44.512000     63.263997
72         16.0    256.0     84.703997     77.215999
73         16.0    512.0    147.648007    100.160003
74         16.0   1024.0    273.903996    146.367997
75         16.0   2048.0    527.935982    241.280004
76         16.0   4096.0   1028.575897    415.423989
77         16.0   8192.0   2036.128044    774.944007
78         16.0  16384.0   4026.175976   1489.375949
79         16.0  32768.0   8023.456573   2924.096107
80         32.0      1.0     22.592001     42.720001
81         32.0      2.0     23.232000     43.680001
82         32.0      4.0     23.680000     43.200001
83         32.0      8.0     24.544001     45.024000
84         32.0     16.0     26.720000     49.632002
85         32.0     32.0     31.504001     55.167999
86         32.0     64.0     44.351999     63.327998
87         32.0    128.0     84.735997     77.087998
88         32.0    256.0    147.712007    100.288004
89         32.0    512.0    273.535997    146.239996
90         32.0   1024.0    528.608024    241.503999
91         32.0   2048.0   1030.368090    415.360004
92         32.0   4096.0   2034.816027    774.176002
93         32.0   8192.0   4031.663895   1490.463972
94         32.0  16384.0   8035.440445   2955.904007
95         32.0  32768.0  16120.847702   6205.503941
96         64.0      1.0     23.152001     43.744002
97         64.0      2.0     23.680000     43.359999
98         64.0      4.0     24.544001     45.152001
99         64.0      8.0     26.752001     50.080001
100        64.0     16.0     31.552002     55.071998
101        64.0     32.0     44.512000     63.247994
102        64.0     64.0     84.687993     76.895997
103        64.0    128.0    147.376001    100.064002
104        64.0    256.0    274.015993    146.656007
105        64.0    512.0    528.447986    241.855994
106        64.0   1024.0   1027.680039    415.423989
107        64.0   2048.0   2033.216000    774.016023
108        64.0   4096.0   4032.144070   1491.392016
109        64.0   8192.0   8032.880783   2928.064108
110        64.0  16384.0  16019.264221   6206.111908
111        64.0  32768.0  31992.319107  12422.176361
112       128.0      1.0     23.680000     43.423999
113       128.0      2.0     24.351999     45.088001
114       128.0      4.0     26.736001     49.759999
115       128.0      8.0     31.520002     55.167999
116       128.0     16.0     44.399999     63.359998
117       128.0     32.0     84.480003     76.895997
118       128.0     64.0    147.648007    100.160003
119       128.0    128.0    274.031997    146.528006
120       128.0    256.0    528.864026    242.016003
121       128.0    512.0   1029.599905    415.455997
122       128.0   1024.0   2035.487890    773.631990
123       128.0   2048.0   4036.432266   1490.815997
124       128.0   4096.0   8039.327621   2928.960085
125       128.0   8192.0  16032.815933   6205.408096
126       128.0  16384.0  31988.735199  12427.103996
127       128.0  32768.0  64045.059204  24743.759155

pr:

sgl-kernel python3 /mnt/co-research/home/yineng/bbuf/sglang/benchmark/kernels/fused_moe_triton/benchmark_deepseekv3_moe_align_blocks.py
✅ CUDA and Triton implementations match
moe-align-block-size-performance:
     batch_size  seq_len          CUDA        Triton
0           1.0      1.0     22.992000     41.855998
1           1.0      2.0     23.008000     41.951999
2           1.0      4.0     23.135999     42.016000
3           1.0      8.0     23.167999     42.112000
4           1.0     16.0     23.360001     42.335998
5           1.0     32.0     23.391999     42.656001
6           1.0     64.0     23.647999     43.520000
7           1.0    128.0     23.936000     43.264002
8           1.0    256.0     24.224000     45.024000
9           1.0    512.0     24.768000     49.536001
10          1.0   1024.0     26.240001     55.039998
11          1.0   2048.0     31.936001     63.295998
12          1.0   4096.0     45.791999     76.991998
13          1.0   8192.0     67.680001    100.128002
14          1.0  16384.0    112.287998    146.464005
15          1.0  32768.0    205.696002    241.952002
16          2.0      1.0     23.056000     42.112000
17          2.0      2.0     23.072001     41.935999
18          2.0      4.0     23.167999     42.144001
19          2.0      8.0     23.360001     42.367999
20          2.0     16.0     23.584001     42.784002
21          2.0     32.0     23.615999     43.744002
22          2.0     64.0     23.808001     43.296002
23          2.0    128.0     24.224000     45.056000
24          2.0    256.0     24.672000     49.440000
25          2.0    512.0     26.272001     54.976001
26          2.0   1024.0     32.000002     63.104004
27          2.0   2048.0     45.664001     77.215999
28          2.0   4096.0     67.648001    100.256003
29          2.0   8192.0    112.159997    146.528006
30          2.0  16384.0    205.504000    242.175996
31          2.0  32768.0    383.807987    415.919989
32          4.0      1.0     23.135999     41.983999
33          4.0      2.0     23.167999     42.272002
34          4.0      4.0     23.200000     42.208001
35          4.0      8.0     23.615999     42.752001
36          4.0     16.0     23.647999     43.648001
37          4.0     32.0     23.840001     43.327998
38          4.0     64.0     24.256000     45.088001
39          4.0    128.0     24.704000     49.679998
40          4.0    256.0     26.272001     55.008002
41          4.0    512.0     31.968001     63.295998
42          4.0   1024.0     45.791999     77.151999
43          4.0   2048.0     67.744002    100.192003
44          4.0   4096.0    112.240002    146.559998
45          4.0   8192.0    205.568001    242.336005
46          4.0  16384.0    383.904010    415.695995
47          4.0  32768.0    747.680008    775.200009
48          8.0      1.0     23.167999     42.272002
49          8.0      2.0     23.184000     42.240001
50          8.0      4.0     23.584001     42.592000
51          8.0      8.0     23.808001     43.744002
52          8.0     16.0     23.808001     43.296002
53          8.0     32.0     24.224000     45.056000
54          8.0     64.0     24.736000     49.408000
55          8.0    128.0     26.400000     55.071998
56          8.0    256.0     31.936001     63.295998
57          8.0    512.0     45.759998     77.055998
58          8.0   1024.0     67.616001     99.936001
59          8.0   2048.0    112.063996    146.592006
60          8.0   4096.0    205.472007    241.919994
61          8.0   8192.0    383.855999    415.327996
62          8.0  16384.0    747.712016    775.936007
63          8.0  32768.0   1449.759960   1489.984035
64         16.0      1.0     23.360001     42.431999
65         16.0      2.0     23.568001     42.720001
66         16.0      4.0     23.647999     43.584000
67         16.0      8.0     23.871999     43.264002
68         16.0     16.0     24.032000     45.120001
69         16.0     32.0     24.736000     49.663998
70         16.0     64.0     26.432000     55.039998
71         16.0    128.0     31.936001     63.231997
72         16.0    256.0     45.728002     77.263996
73         16.0    512.0     67.680001    100.192003
74         16.0   1024.0    112.287998    146.623999
75         16.0   2048.0    205.632001    242.336005
76         16.0   4096.0    384.207994    416.047990
77         16.0   8192.0    746.752024    774.944007
78         16.0  16384.0   1450.464010   1489.567995
79         16.0  32768.0   2905.888081   2934.848070
80         32.0      1.0     23.360001     42.688001
81         32.0      2.0     23.615999     43.680001
82         32.0      4.0     23.808001     43.359999
83         32.0      8.0     24.064001     45.088001
84         32.0     16.0     24.672000     49.504001
85         32.0     32.0     26.272001     55.039998
86         32.0     64.0     31.936001     63.199997
87         32.0    128.0     45.791999     77.119999
88         32.0    256.0     67.648001     99.936001
89         32.0    512.0    112.032004    146.559998
90         32.0   1024.0    205.504000    241.919994
91         32.0   2048.0    383.904010    415.423989
92         32.0   4096.0    746.944010    774.399996
93         32.0   8192.0   1449.695945   1490.944028
94         32.0  16384.0   2904.160023   2950.880051
95         32.0  32768.0   5746.111870   6202.943802
96         64.0      1.0     23.647999     43.968000
97         64.0      2.0     23.840001     43.359999
98         64.0      4.0     24.064001     45.120001
99         64.0      8.0     24.736000     49.727999
100        64.0     16.0     26.272001     55.135999
101        64.0     32.0     31.936001     63.359998
102        64.0     64.0     45.696001     76.863997
103        64.0    128.0     67.776002    100.224003
104        64.0    256.0    112.287998    146.607995
105        64.0    512.0    205.568001    241.919994
106        64.0   1024.0    383.967996    415.583998
107        64.0   2048.0    746.528029    773.984015
108        64.0   4096.0   1450.176001   1491.359949
109        64.0   8192.0   2902.623892   2924.191952
110        64.0  16384.0   5711.135864   6205.567837
111        64.0  32768.0  11378.128052  12420.703888
112       128.0      1.0     24.032000     43.391999
113       128.0      2.0     24.240000     45.088001
114       128.0      4.0     24.704000     49.695998
115       128.0      8.0     26.256001     55.008002
116       128.0     16.0     31.936001     63.455999
117       128.0     32.0     45.696001     77.215999
118       128.0     64.0     67.776002    100.128002
119       128.0    128.0    112.255998    146.656007
120       128.0    256.0    205.599993    241.983995
121       128.0    512.0    384.032011    415.776014
122       128.0   1024.0    747.039974    773.920000
123       128.0   2048.0   1450.335979   1490.592003
124       128.0   4096.0   2902.335882   2926.687956
125       128.0   8192.0   5711.808205   6211.552143
126       128.0  16384.0  11372.848511  12420.160294
127       128.0  32768.0  22776.657104  24721.887589

The pr has solved all the badcase performace, we can consider remove triton moe_align_block_size now.

@zhyncs

zhyncs · 2025-02-06T15:47:35Z

@HaiShaw @saienduri I don't think AMD CI's failure has anything to do with this PR. Can you please take a look?

BBuf added 2 commits February 6, 2025 12:28

refine

51f8e2d

fused moe

ef5161c

BBuf requested review from zhyncs, ispobock, HandH1998, yizhang2077, merrymercy and HaiShaw as code owners February 6, 2025 14:53

BBuf added 2 commits February 6, 2025 14:55

upd

13aa1f3

upd

3151846

BBuf mentioned this pull request Feb 6, 2025

[Bug] how to solve illegal memory access in moe_align_block_size kernel optimization #3339

Closed

5 tasks

zhyncs added 2 commits February 6, 2025 23:47

Merge branch 'main' into temp_1

66a4188

Merge branch 'main' into temp_1

85bd2fe

zhyncs mentioned this pull request Feb 6, 2025

fix sgl-kernel build failure on AMD #3352

Merged

5 tasks

zhyncs approved these changes Feb 6, 2025

View reviewed changes

zhyncs merged commit cdae77b into main Feb 6, 2025
20 of 21 checks passed

zhyncs deleted the temp_1 branch February 6, 2025 16:53

yiakwy-xpu-ml-framework-team mentioned this pull request Feb 7, 2025

[Feature]: enable multi-blocks execution for moe align kernel ROCm/aiter#107

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize moe_align_kernel cuda #3347

optimize moe_align_kernel cuda #3347

BBuf commented Feb 6, 2025

zhyncs commented Feb 6, 2025

optimize moe_align_kernel cuda #3347

optimize moe_align_kernel cuda #3347

Conversation

BBuf commented Feb 6, 2025

DeepSeek V3 end2end benchmark

micro benchmark in H200

zhyncs commented Feb 6, 2025