-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Encoder Decoder] Update Mllama to run with both FlashAttention and XFormers #9982
Conversation
Pull from head
Thanks @heheda12345 for pointing out the issue. To summarize if we run python3 examples/test_mllama.py then on H100 we find that the output for flash-attention backend starts diverging from that of the xFormers backend and we want to debug the reason for this. I added some logs to print out the logits in mllama.py to debug this further. The examples/test_mllama.py is also included in the pr. I will remove them once the debugging is over. Output for Flash Attention Run
Output for xFormers Run
The mis-match starts happening in the last token here (in vs swimming). The logits of these 2 token-ids are very close (in FlashAttention run both are 18 while in the xFormers run one is 18.125 vs 18). I think this diff in the logits is causing them to start differing at this token position and beyond. Overall the logits seem close for the 2 backends. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your great work and deep investigation to the output difference. Given the logprobs of the two tokens are similar, I believe the difference is caused by some precision issue instead of some bugs, and we can continue on this pr. Added some small suggestions.
Thanks @heheda12345 for the review. Addressed your comments. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the great work!
CC @ywang96
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sroy745 LGTM! Could you please rebase this RP for the merge? Thanks
Signed-off-by: Sourashis Roy <[email protected]>
Signed-off-by: Sourashis Roy <[email protected]>
Signed-off-by: Sourashis Roy <[email protected]>
@heheda12345 / @ywang96 I resynced to head and tests are passing. One thing is that I had to update tests/test_config.py to skip test_is_encoder_decoder for rcom. The reason is that this test now starts failing when it tries to run for "meta-llama/Llama-3.2-11B-Vision". The import for mllama.py fails when it tries to import the newly added imports for FlashAttentionMetadata and xFormersMetadata. Skipping this should be fine right because the encoder-decoder models are not supported in rcom? PTAL |
…Formers (vllm-project#9982) Signed-off-by: Sourashis Roy <[email protected]> Signed-off-by: Dipika <[email protected]>
…Formers (vllm-project#9982) Signed-off-by: Sourashis Roy <[email protected]>
…Formers (vllm-project#9982) Signed-off-by: Sourashis Roy <[email protected]> Signed-off-by: Sumit Dubey <[email protected]>
…Formers (vllm-project#9982) Signed-off-by: Sourashis Roy <[email protected]>
…Formers (vllm-project#9982) Signed-off-by: Sourashis Roy <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]>
…Formers (vllm-project#9982) Signed-off-by: Sourashis Roy <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]>
…Formers (vllm-project#9982) Signed-off-by: Sourashis Roy <[email protected]>
In this pr we update Mllama to run on both xFormers and Flash Attention backend. Currently it runs only with the xFormers backend. This pr makes the following changes