[Encoder Decoder] Update Mllama to run with both FlashAttention and XFormers #9982

sroy745 · 2024-11-04T06:40:53Z

In this pr we update Mllama to run on both xFormers and Flash Attention backend. Currently it runs only with the xFormers backend. This pr makes the following changes

Updates mllama.py to run with both xFormers and FlashAttention. There were 2 changes needed for this (a) update attention_with_mask to use appropriately update the cache depending on the backend being used (b) update the shape of the query when computing the attention. Currently it works fine with xFormers because xFormer backend does not enforce the query to be of shape [num_tokens, hidden_size].
Updated enc_dec_model_runner.py to no longer force the backend to be xFormers for Mllama
Updated the test test_mllama.py to run with both xFormers and FlashAttention.
Updated test_e2e_correctness.py to clear the backend cache at the begining of each test run. Without it looks like the cached backend value gets reused across tests.

Pull from head

sroy745 · 2024-11-06T07:21:56Z

Thanks @heheda12345 for pointing out the issue.

To summarize if we run python3 examples/test_mllama.py then on H100 we find that the output for flash-attention backend starts diverging from that of the xFormers backend and we want to debug the reason for this.

I added some logs to print out the logits in mllama.py to debug this further. The examples/test_mllama.py is also included in the pr. I will remove them once the debugging is over.

Output for Flash Attention Run

INFO 11-06 07:03:40 mllama.py:1146] sorted_logits tensor([[ 13.8125,  12.7500,  12.3750,  ...,  -9.3750, -10.1250, -10.3125]],
INFO 11-06 07:03:40 mllama.py:1146]        device='cuda:0', dtype=torch.bfloat16)
INFO 11-06 07:03:40 mllama.py:1147] sorted_indices tensor([[  578,  1115,  9062,  ..., 98323, 48046, 89920]], device='cuda:0')
INFO 11-06 07:03:40 mllama.py:1149] next_tokens SamplerOutput(outputs=[CompletionSequenceGroupOutput(samples=[SequenceOutput(parent_seq_id=0, output_token=578, logprobs={578: Logprob(logprob=inf, rank=None, decoded_token=None)})], prompt_logprobs=None)], sampled_token_probs=None, sampled_token_ids=None, spec_decode_worker_metrics=None)
INFO 11-06 07:03:40 mllama.py:1146] sorted_logits tensor([[18.1250, 16.5000, 15.8750,  ..., -8.6250, -8.6875, -9.0000]],
INFO 11-06 07:03:40 mllama.py:1146]        device='cuda:0', dtype=torch.bfloat16)
INFO 11-06 07:03:40 mllama.py:1147] sorted_indices tensor([[  2217,   1176,   5448,  ...,  44326, 116655,  90609]],
INFO 11-06 07:03:40 mllama.py:1147]        device='cuda:0')
INFO 11-06 07:03:40 mllama.py:1149] next_tokens SamplerOutput(outputs=[CompletionSequenceGroupOutput(samples=[SequenceOutput(parent_seq_id=0, output_token=2217, logprobs={2217: Logprob(logprob=inf, rank=None, decoded_token=None)})], prompt_logprobs=None)], sampled_token_probs=None, sampled_token_ids=None, spec_decode_worker_metrics=None)
INFO 11-06 07:03:40 mllama.py:1146] sorted_logits tensor([[19.1250, 18.3750, 17.6250,  ..., -7.3125, -7.5000, -7.7812]],
INFO 11-06 07:03:40 mllama.py:1146]        device='cuda:0', dtype=torch.bfloat16)
INFO 11-06 07:03:40 mllama.py:1147] sorted_indices tensor([[  5039,  62991,    374,  ..., 108112, 111896, 123635]],
INFO 11-06 07:03:40 mllama.py:1147]        device='cuda:0')
INFO 11-06 07:03:40 mllama.py:1149] next_tokens SamplerOutput(outputs=[CompletionSequenceGroupOutput(samples=[SequenceOutput(parent_seq_id=0, output_token=5039, logprobs={5039: Logprob(logprob=inf, rank=None, decoded_token=None)})], prompt_logprobs=None)], sampled_token_probs=None, sampled_token_ids=None, spec_decode_worker_metrics=None)
INFO 11-06 07:03:40 mllama.py:1146] sorted_logits tensor([[20.0000, 16.6250, 16.0000,  ..., -7.1250, -7.1875, -8.9375]],
INFO 11-06 07:03:40 mllama.py:1146]        device='cuda:0', dtype=torch.bfloat16)
INFO 11-06 07:03:40 mllama.py:1147] sorted_indices tensor([[   264,    279,   1403,  ...,  88885, 108602,  64170]],
INFO 11-06 07:03:40 mllama.py:1147]        device='cuda:0')
INFO 11-06 07:03:40 mllama.py:1149] next_tokens SamplerOutput(outputs=[CompletionSequenceGroupOutput(samples=[SequenceOutput(parent_seq_id=0, output_token=264, logprobs={264: Logprob(logprob=inf, rank=None, decoded_token=None)})], prompt_logprobs=None)], sampled_token_probs=None, sampled_token_ids=None, spec_decode_worker_metrics=None)
INFO 11-06 07:03:40 mllama.py:1146] sorted_logits tensor([[18.8750, 16.2500, 15.9375,  ..., -9.0625, -9.1875, -9.8125]],
INFO 11-06 07:03:40 mllama.py:1146]        device='cuda:0', dtype=torch.bfloat16)
INFO 11-06 07:03:40 mllama.py:1147] sorted_indices tensor([[  8762,   3345,  40132,  ..., 124479,  82422,  83788]],
INFO 11-06 07:03:40 mllama.py:1147]        device='cuda:0')
INFO 11-06 07:03:40 mllama.py:1149] next_tokens SamplerOutput(outputs=[CompletionSequenceGroupOutput(samples=[SequenceOutput(parent_seq_id=0, output_token=8762, logprobs={8762: Logprob(logprob=inf, rank=None, decoded_token=None)})], prompt_logprobs=None)], sampled_token_probs=None, sampled_token_ids=None, spec_decode_worker_metrics=None)
INFO 11-06 07:03:40 mllama.py:1146] sorted_logits tensor([[ 18.2500,  17.8750,  16.8750,  ...,  -9.1875,  -9.2500, -10.0000]],
INFO 11-06 07:03:40 mllama.py:1146]        device='cuda:0', dtype=torch.bfloat16)
INFO 11-06 07:03:40 mllama.py:1147] sorted_indices tensor([[ 34353,  40132,  32498,  ...,   1714, 118633,  63345]],
INFO 11-06 07:03:40 mllama.py:1147]        device='cuda:0')
INFO 11-06 07:03:40 mllama.py:1149] next_tokens SamplerOutput(outputs=[CompletionSequenceGroupOutput(samples=[SequenceOutput(parent_seq_id=0, output_token=34353, logprobs={34353: Logprob(logprob=inf, rank=None, decoded_token=None)})], prompt_logprobs=None)], sampled_token_probs=None, sampled_token_ids=None, spec_decode_worker_metrics=None)
INFO 11-06 07:03:40 mllama.py:1146] sorted_logits tensor([[ 24.5000,  15.3750,  15.1250,  ..., -10.3750, -10.9375, -11.0625]],
INFO 11-06 07:03:40 mllama.py:1146]        device='cuda:0', dtype=torch.bfloat16)
INFO 11-06 07:03:40 mllama.py:1147] sorted_indices tensor([[   569,    329,   2402,  ...,  82107, 120381,  80088]],
INFO 11-06 07:03:40 mllama.py:1147]        device='cuda:0')
INFO 11-06 07:03:40 mllama.py:1149] next_tokens SamplerOutput(outputs=[CompletionSequenceGroupOutput(samples=[SequenceOutput(parent_seq_id=0, output_token=569, logprobs={569: Logprob(logprob=inf, rank=None, decoded_token=None)})], prompt_logprobs=None)], sampled_token_probs=None, sampled_token_ids=None, spec_decode_worker_metrics=None)
INFO 11-06 07:03:40 mllama.py:1146] sorted_logits tensor([[19.5000, 16.5000, 16.2500,  ..., -9.5000, -9.5625, -9.8125]],
INFO 11-06 07:03:40 mllama.py:1146]        device='cuda:0', dtype=torch.bfloat16)
INFO 11-06 07:03:40 mllama.py:1147] sorted_indices tensor([[37085,   304,    11,  ..., 19811, 20133, 41077]], device='cuda:0')
INFO 11-06 07:03:40 mllama.py:1149] next_tokens SamplerOutput(outputs=[CompletionSequenceGroupOutput(samples=[SequenceOutput(parent_seq_id=0, output_token=37085, logprobs={37085: Logprob(logprob=inf, rank=None, decoded_token=None)})], prompt_logprobs=None)], sampled_token_probs=None, sampled_token_ids=None, spec_decode_worker_metrics=None)
INFO 11-06 07:03:40 mllama.py:1146] sorted_logits tensor([[18.0000, 18.0000, 17.0000,  ..., -8.8750, -9.0625, -9.6875]],
INFO 11-06 07:03:40 mllama.py:1146]        device='cuda:0', dtype=torch.bfloat16)
INFO 11-06 07:03:40 mllama.py:1147] sorted_indices tensor([[   304,  24269,     13,  ..., 108602,  62785,  74818]],
INFO 11-06 07:03:40 mllama.py:1147]        device='cuda:0')
INFO 11-06 07:03:40 mllama.py:1149] next_tokens SamplerOutput(outputs=[CompletionSequenceGroupOutput(samples=[SequenceOutput(parent_seq_id=0, output_token=304, logprobs={304: Logprob(logprob=inf, rank=None, decoded_token=None)})], prompt_logprobs=None)], sampled_token_probs=None, sampled_token_ids=None, spec_decode_worker_metrics=None)
Processed prompts: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.13it/s, est. speed input: 13.54 toks/s, output: 10.15 toks/s]
 The image shows a male mallard duck in

Output for xFormers Run

Processed prompts:   0%|                                                                            | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO 11-06 07:06:31 mllama.py:1146] sorted_logits tensor([[ 13.8125,  12.6875,  12.4375,  ...,  -9.3125, -10.0625, -10.2500]],
INFO 11-06 07:06:31 mllama.py:1146]        device='cuda:0', dtype=torch.bfloat16)
INFO 11-06 07:06:31 mllama.py:1147] sorted_indices tensor([[  578,  1115,  9062,  ..., 98323, 48046, 89920]], device='cuda:0')
INFO 11-06 07:06:31 mllama.py:1149] next_tokens SamplerOutput(outputs=[CompletionSequenceGroupOutput(samples=[SequenceOutput(parent_seq_id=0, output_token=578, logprobs={578: Logprob(logprob=inf, rank=None, decoded_token=None)})], prompt_logprobs=None)], sampled_token_probs=None, sampled_token_ids=None, spec_decode_worker_metrics=None)
INFO 11-06 07:06:31 mllama.py:1146] sorted_logits tensor([[18.1250, 16.5000, 15.8750,  ..., -8.6250, -8.6250, -9.0625]],
INFO 11-06 07:06:31 mllama.py:1146]        device='cuda:0', dtype=torch.bfloat16)
INFO 11-06 07:06:31 mllama.py:1147] sorted_indices tensor([[  2217,   1176,   5448,  ...,  44326, 116655,  90609]],
INFO 11-06 07:06:31 mllama.py:1147]        device='cuda:0')
INFO 11-06 07:06:31 mllama.py:1149] next_tokens SamplerOutput(outputs=[CompletionSequenceGroupOutput(samples=[SequenceOutput(parent_seq_id=0, output_token=2217, logprobs={2217: Logprob(logprob=inf, rank=None, decoded_token=None)})], prompt_logprobs=None)], sampled_token_probs=None, sampled_token_ids=None, spec_decode_worker_metrics=None)
INFO 11-06 07:06:31 mllama.py:1146] sorted_logits tensor([[19.1250, 18.3750, 17.6250,  ..., -7.3125, -7.5625, -7.7812]],
INFO 11-06 07:06:31 mllama.py:1146]        device='cuda:0', dtype=torch.bfloat16)
INFO 11-06 07:06:31 mllama.py:1147] sorted_indices tensor([[  5039,  62991,    374,  ..., 108112, 111896, 123635]],
INFO 11-06 07:06:31 mllama.py:1147]        device='cuda:0')
INFO 11-06 07:06:31 mllama.py:1149] next_tokens SamplerOutput(outputs=[CompletionSequenceGroupOutput(samples=[SequenceOutput(parent_seq_id=0, output_token=5039, logprobs={5039: Logprob(logprob=inf, rank=None, decoded_token=None)})], prompt_logprobs=None)], sampled_token_probs=None, sampled_token_ids=None, spec_decode_worker_metrics=None)
INFO 11-06 07:06:31 mllama.py:1146] sorted_logits tensor([[20.0000, 16.6250, 16.0000,  ..., -7.1250, -7.1875, -8.9375]],
INFO 11-06 07:06:31 mllama.py:1146]        device='cuda:0', dtype=torch.bfloat16)
INFO 11-06 07:06:31 mllama.py:1147] sorted_indices tensor([[   264,    279,   1403,  ...,  88885, 108602,  64170]],
INFO 11-06 07:06:31 mllama.py:1147]        device='cuda:0')
INFO 11-06 07:06:31 mllama.py:1149] next_tokens SamplerOutput(outputs=[CompletionSequenceGroupOutput(samples=[SequenceOutput(parent_seq_id=0, output_token=264, logprobs={264: Logprob(logprob=inf, rank=None, decoded_token=None)})], prompt_logprobs=None)], sampled_token_probs=None, sampled_token_ids=None, spec_decode_worker_metrics=None)
INFO 11-06 07:06:31 mllama.py:1146] sorted_logits tensor([[18.8750, 16.2500, 15.9375,  ..., -9.0625, -9.2500, -9.8125]],
INFO 11-06 07:06:31 mllama.py:1146]        device='cuda:0', dtype=torch.bfloat16)
INFO 11-06 07:06:31 mllama.py:1147] sorted_indices tensor([[  8762,   3345,  40132,  ..., 124479,  82422,  83788]],
INFO 11-06 07:06:31 mllama.py:1147]        device='cuda:0')
INFO 11-06 07:06:31 mllama.py:1149] next_tokens SamplerOutput(outputs=[CompletionSequenceGroupOutput(samples=[SequenceOutput(parent_seq_id=0, output_token=8762, logprobs={8762: Logprob(logprob=inf, rank=None, decoded_token=None)})], prompt_logprobs=None)], sampled_token_probs=None, sampled_token_ids=None, spec_decode_worker_metrics=None)
INFO 11-06 07:06:31 mllama.py:1146] sorted_logits tensor([[ 18.2500,  18.0000,  16.7500,  ...,  -9.1875,  -9.3125, -10.0625]],
INFO 11-06 07:06:31 mllama.py:1146]        device='cuda:0', dtype=torch.bfloat16)
INFO 11-06 07:06:31 mllama.py:1147] sorted_indices tensor([[ 34353,  40132,  32498,  ...,  64460, 118633,  63345]],
INFO 11-06 07:06:31 mllama.py:1147]        device='cuda:0')
INFO 11-06 07:06:31 mllama.py:1149] next_tokens SamplerOutput(outputs=[CompletionSequenceGroupOutput(samples=[SequenceOutput(parent_seq_id=0, output_token=34353, logprobs={34353: Logprob(logprob=inf, rank=None, decoded_token=None)})], prompt_logprobs=None)], sampled_token_probs=None, sampled_token_ids=None, spec_decode_worker_metrics=None)
INFO 11-06 07:06:31 mllama.py:1146] sorted_logits tensor([[ 24.5000,  15.3125,  15.0625,  ..., -10.3750, -11.0000, -11.0000]],
INFO 11-06 07:06:31 mllama.py:1146]        device='cuda:0', dtype=torch.bfloat16)
INFO 11-06 07:06:31 mllama.py:1147] sorted_indices tensor([[   569,    329,   2402,  ...,  82107,  80088, 120381]],
INFO 11-06 07:06:31 mllama.py:1147]        device='cuda:0')
INFO 11-06 07:06:31 mllama.py:1149] next_tokens SamplerOutput(outputs=[CompletionSequenceGroupOutput(samples=[SequenceOutput(parent_seq_id=0, output_token=569, logprobs={569: Logprob(logprob=inf, rank=None, decoded_token=None)})], prompt_logprobs=None)], sampled_token_probs=None, sampled_token_ids=None, spec_decode_worker_metrics=None)
INFO 11-06 07:06:31 mllama.py:1146] sorted_logits tensor([[19.5000, 16.3750, 16.2500,  ..., -9.5000, -9.6250, -9.7500]],
INFO 11-06 07:06:31 mllama.py:1146]        device='cuda:0', dtype=torch.bfloat16)
INFO 11-06 07:06:31 mllama.py:1147] sorted_indices tensor([[37085,   304,    11,  ..., 19811, 20133, 41077]], device='cuda:0')
INFO 11-06 07:06:31 mllama.py:1149] next_tokens SamplerOutput(outputs=[CompletionSequenceGroupOutput(samples=[SequenceOutput(parent_seq_id=0, output_token=37085, logprobs={37085: Logprob(logprob=inf, rank=None, decoded_token=None)})], prompt_logprobs=None)], sampled_token_probs=None, sampled_token_ids=None, spec_decode_worker_metrics=None)
INFO 11-06 07:06:31 mllama.py:1146] sorted_logits tensor([[18.1250, 18.0000, 17.0000,  ..., -8.8750, -9.0625, -9.6875]],
INFO 11-06 07:06:31 mllama.py:1146]        device='cuda:0', dtype=torch.bfloat16)
INFO 11-06 07:06:31 mllama.py:1147] sorted_indices tensor([[ 24269,    304,     13,  ..., 108602,  62785,  74818]],
INFO 11-06 07:06:31 mllama.py:1147]        device='cuda:0')
INFO 11-06 07:06:31 mllama.py:1149] next_tokens SamplerOutput(outputs=[CompletionSequenceGroupOutput(samples=[SequenceOutput(parent_seq_id=0, output_token=24269, logprobs={24269: Logprob(logprob=inf, rank=None, decoded_token=None)})], prompt_logprobs=None)], sampled_token_probs=None, sampled_token_ids=None, spec_decode_worker_metrics=None)
Processed prompts: 100%|███████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.09it/s, est. speed input: 13.10 toks/s, output: 9.82 toks/s]
 The image shows a male mallard duck swimming

The mis-match starts happening in the last token here (in vs swimming). The logits of these 2 token-ids are very close (in FlashAttention run both are 18 while in the xFormers run one is 18.125 vs 18). I think this diff in the logits is causing them to start differing at this token position and beyond. Overall the logits seem close for the 2 backends.

heheda12345

Thanks for your great work and deep investigation to the output difference. Given the logprobs of the two tokens are similar, I believe the difference is caused by some precision issue instead of some bugs, and we can continue on this pr. Added some small suggestions.

tests/encoder_decoder/test_e2e_correctness.py

tests/models/encoder_decoder/vision_language/test_mllama.py

vllm/model_executor/models/mllama.py

vllm/worker/enc_dec_model_runner.py

…nto sroy-encdec-flash

sroy745 · 2024-11-08T05:00:06Z

Thanks @heheda12345 for the review. Addressed your comments. PTAL.

heheda12345

LGTM. Thanks for the great work!
CC @ywang96

mergify · 2024-11-09T03:32:21Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sroy745.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ywang96

@sroy745 LGTM! Could you please rebase this RP for the merge? Thanks

Signed-off-by: Sourashis Roy <[email protected]>

sroy745 · 2024-11-12T18:44:19Z

@heheda12345 / @ywang96 I resynced to head and tests are passing. One thing is that I had to update tests/test_config.py to skip test_is_encoder_decoder for rcom. The reason is that this test now starts failing when it tries to run for "meta-llama/Llama-3.2-11B-Vision". The import for mllama.py fails when it tries to import the newly added imports for FlashAttentionMetadata and xFormersMetadata. Skipping this should be fine right because the encoder-decoder models are not supported in rcom? PTAL

…Formers (vllm-project#9982) Signed-off-by: Sourashis Roy <[email protected]> Signed-off-by: Dipika <[email protected]>

…Formers (vllm-project#9982) Signed-off-by: Sourashis Roy <[email protected]>

…Formers (vllm-project#9982) Signed-off-by: Sourashis Roy <[email protected]> Signed-off-by: Sumit Dubey <[email protected]>

…Formers (vllm-project#9982) Signed-off-by: Sourashis Roy <[email protected]>

…Formers (vllm-project#9982) Signed-off-by: Sourashis Roy <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]>

…Formers (vllm-project#9982) Signed-off-by: Sourashis Roy <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]>

…Formers (vllm-project#9982) Signed-off-by: Sourashis Roy <[email protected]>

sroy745 added 30 commits May 28, 2024 20:39

Merge pull request #1 from vllm-project/main

5650b95

Pull from head

Merge branch 'vllm-project:main' into main

8f36146

Merge branch 'vllm-project:main' into main

9e75057

Merge branch 'vllm-project:main' into main

db2c679

Merge branch 'vllm-project:main' into main

8d7512c

Merge branch 'vllm-project:main' into main

1473f74

Merge branch 'vllm-project:main' into main

4013e1a

Merge branch 'vllm-project:main' into main

2dbdd78

Merge branch 'vllm-project:main' into main

b3575e9

Merge branch 'vllm-project:main' into main

94b0d43

Merge branch 'vllm-project:main' into main

fa8fedf

Merge branch 'vllm-project:main' into main

6ed96b4

Merge branch 'vllm-project:main' into main

b71c533

Merge branch 'vllm-project:main' into main

57babef

Merge branch 'vllm-project:main' into main

4b19bac

Merge branch 'vllm-project:main' into main

eb7a1c4

Merge branch 'vllm-project:main' into main

7e2c87e

Merge branch 'vllm-project:main' into main

6212d5f

Merge branch 'vllm-project:main' into main

5491438

Merge branch 'vllm-project:main' into main

68e080a

Merge branch 'vllm-project:main' into main

55e4332

Merge branch 'vllm-project:main' into main

532eb48

Merge branch 'vllm-project:main' into main

7cea056

Merge branch 'vllm-project:main' into main

185e056

Merge branch 'vllm-project:main' into main

e2be95f

Merge branch 'vllm-project:main' into main

2ed5473

Merge branch 'vllm-project:main' into main

efa4714

Merge branch 'vllm-project:main' into main

fb87d34

Merge branch 'vllm-project:main' into main

5419e49

Merge branch 'vllm-project:main' into main

9ba12f8

sroy745 added 3 commits November 6, 2024 07:23

Debug

53c8a72

Debug lines

36e3abd

Removing debug

e050052

heheda12345 reviewed Nov 6, 2024

View reviewed changes

sroy745 and others added 3 commits November 7, 2024 16:20

Merge branch 'vllm-project:main' into main

fd9fdff

Address comments

0b53354

Merge branch 'sroy-encdec-flash' of https://github.com/sroy745/vllm i…

838fee9

…nto sroy-encdec-flash

heheda12345 approved these changes Nov 8, 2024

View reviewed changes

mergify bot added the needs-rebase label Nov 9, 2024

ywang96 approved these changes Nov 9, 2024

View reviewed changes

Merge branch 'main' into sroy-encdec-flash

e25a2ae

mergify bot removed the needs-rebase label Nov 11, 2024

sroy745 and others added 2 commits November 11, 2024 11:02

Merge branch 'vllm-project:main' into main

366cbf7

Merge remote-tracking branch 'origin/main' into sroy-encdec-flash

a5868ee

ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 12, 2024

sroy745 added 3 commits November 12, 2024 06:41

Dummy

b11d637

Signed-off-by: Sourashis Roy <[email protected]>

Dummy

c7df9d6

Signed-off-by: Sourashis Roy <[email protected]>

Fix tests

1a63f7a

Signed-off-by: Sourashis Roy <[email protected]>

ywang96 merged commit b41fb9d into vllm-project:main Nov 12, 2024
48 of 49 checks passed

dsikka pushed a commit to neuralmagic/vllm that referenced this pull request Nov 13, 2024

[Encoder Decoder] Update Mllama to run with both FlashAttention and X…

45d77df

…Formers (vllm-project#9982) Signed-off-by: Sourashis Roy <[email protected]> Signed-off-by: Dipika <[email protected]>

rickyyx pushed a commit to rickyyx/vllm that referenced this pull request Nov 13, 2024

[Encoder Decoder] Update Mllama to run with both FlashAttention and X…

15bae25

…Formers (vllm-project#9982) Signed-off-by: Sourashis Roy <[email protected]>

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[Encoder Decoder] Update Mllama to run with both FlashAttention and X…

aa3ded2

…Formers (vllm-project#9982) Signed-off-by: Sourashis Roy <[email protected]>

sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024

[Encoder Decoder] Update Mllama to run with both FlashAttention and X…

abdeaf6

…Formers (vllm-project#9982) Signed-off-by: Sourashis Roy <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Encoder Decoder] Update Mllama to run with both FlashAttention and XFormers #9982

[Encoder Decoder] Update Mllama to run with both FlashAttention and XFormers #9982

sroy745 commented Nov 4, 2024 •

edited

Loading

sroy745 commented Nov 6, 2024

heheda12345 left a comment

sroy745 commented Nov 8, 2024

heheda12345 left a comment

mergify bot commented Nov 9, 2024

ywang96 left a comment

sroy745 commented Nov 12, 2024 •

edited

Loading

[Encoder Decoder] Update Mllama to run with both FlashAttention and XFormers #9982

[Encoder Decoder] Update Mllama to run with both FlashAttention and XFormers #9982

Conversation

sroy745 commented Nov 4, 2024 • edited Loading

sroy745 commented Nov 6, 2024

heheda12345 left a comment

Choose a reason for hiding this comment

sroy745 commented Nov 8, 2024

heheda12345 left a comment

Choose a reason for hiding this comment

mergify bot commented Nov 9, 2024

ywang96 left a comment

Choose a reason for hiding this comment

sroy745 commented Nov 12, 2024 • edited Loading

sroy745 commented Nov 4, 2024 •

edited

Loading

sroy745 commented Nov 12, 2024 •

edited

Loading