-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] how to solve illegal memory access in moe_align_block_size kernel optimization #3339
Comments
Have you memset the content of token_cnts_buffer to zero? Is it a tensor constructed by zeros_like? Besides, I did not find the script benchmark_deepseekv3_moe_align_blocks in the repo, can you kindly provide this script? |
I have try, but it failed too. The script exist here. |
In the benchmark script, both cumsum and token_cnts_buffer are allocated with torch.empty and are therefore uninitialized. Have you tried initializing both of them to zero? Because these tensors are uninitialized, the BFS allocator may reuse memory from previous operations, resulting in arbitrary initial values. Thus, the index rank_post_pad = token_cnt + cumsum[expert_id] may contain an invalid index. I'm not sure about other buffers, but the logic should apply |
I can have try, and I will submit a new temp code branch for your review. |
Thanks, you are right, I have ignore the tensor init zero, now the |
cool thanks for the insight and it works! |
Solved in #3347 . |
@BBuf great work I will rebase my multi-block partition based on this version and make it as simple as possible. Once It is completed hope you can help to review |
Checklist
Describe the bug
As mentioned in lines of code, when attempting to optimize the most expensive write operation of
sorted_token_ids
in themoe_align_block_size
of DeepSeek V3, using multiple Thread Blocks instead of a single Block triggers anillegal global memory write access
. Even directly replacing these lines of code with the Triton Stage4 kernel here results in the sameillegal global memory write access
within the Triton kernel. This issue has persisted for nearly a month, and the cause has not yet been identified, so I am opening an issue to seek help.Based on the kernel benchmark results from H200, the performance of the current CUDA kernel degrades to be slower than Triton when the number of tokens is
>= 4096
. On H100, the performance degrades to be slower than Triton when the number of tokens is>= 32768
. The reason is the lines of code I pointed out earlier. if multiple Thread Blocks are used, the fastest execution speed can be achieved in all token scenarios.Additionally, the original implementation of the VLLM kernel exhibits similar behavior, even though it uses shared memory for counting when the number of tokens is <= 65536. According to my benchmark results, its performance is still significantly slower than both the sgl-kernel CUDA operators and the Triton version, so I will not discuss the
moe_align_block_size
kernel in VLLM.The code below is the result of modifying the performance-critical lines to use element-wise multiple Thread Blocks.
The error message obtained using COMPUTE-SANITIZER is: Invalid global write of size 4 bytes. And it happend in line code
sorted_token_ids[rank_post_pad] = i;
.Below are the benchmark results for the sgl-kernel moe_align_block_size and the Triton version on H100 and H200:
H100
H200
Reproduction
Change the code above and run
python3 benchmark/kernels/fused_moe_triton/benchmark_deepseekv3_moe_align_blocks.py
.Environment
No Limit.
The text was updated successfully, but these errors were encountered: