Adding support for encoder-decoder models, like T5 or BART #187

shermansiu · 2023-06-21T13:24:04Z

Will there be added support for encoder-decoder models, like T5 or BART? All of the currently supported models are decoder-only.

zhuohan123 · 2023-06-21T14:26:46Z

Yes, this is in our plan. Adding these models requires modifying vLLM's cache block manager to also manage the attention cache of the encoder, which is a notable modification. Feel free to talk to us if you are interested to contribute and accelerate this process.

shermansiu · 2023-06-21T21:06:40Z

So... to contribute, we would need to re-implement the model in https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/<MODEL_NAME>.py, except with paged attention (i.e. replace self.attn with a paged version and use a KVCache during computation)?

It also seems like most linear projections are replaced with either ColumnParallelLinear or RowParallelLinear, right? So nn.Linear(small, big) is replaced with ColumnParallelLinear(small, big) (thus parallelizing the large number of columns) and nn.Linear(big, small) is replaced by RowParallelLinear(big, small)?

shermansiu · 2023-06-21T21:09:59Z

Are https://github.com/vllm-project/vllm/pull/60/files and https://github.com/vllm-project/vllm/pull/50/files good reference PRs for this?

shermansiu · 2023-06-21T21:52:59Z

I see you've already answered this in the FAQ here: https://vllm.readthedocs.io/en/latest/models/adding_model.html

js8544 · 2023-10-17T14:43:17Z

@zhuohan123 Hi, I'm interested in implementing support for encode-decoder models. Does it require any changes other than what's listed in https://vllm.readthedocs.io/en/latest/models/adding_model.html?

js8544 · 2023-11-24T07:21:11Z

@WoosukKwon @zhuohan123 Hi, my team plans to work on T5 support. We would like to ask a few questions before we start.

Is the vLLM team currently working or planning to work on this? If so then there's no point for us to do it.
@zhuohan123 said above that it requires cache block manager to also manage the attention cache of the encoder. However, AFAIU encoder doesn't need kv caches. Instead it should manage decoder's cross attention kv cache, right?
Apart from managing cross attention kv cache in block_manager.py and implementing the model in t5.py. Are there any other components that need to change. Could you briefly describle how to implement this with minimal change?

Any help is appreciated. Thanks in advance!

zhuohan123 · 2023-11-24T07:39:04Z

@WoosukKwon @zhuohan123 Hi, my team plans to work on T5 support. We would like to ask a few questions before we start.

Is the vLLM team currently working or planning to work on this? If so then there's no point for us to do it.

We are not actively working on this. Please go ahead!

@zhuohan123 said above that it requires cache block manager to also manage the attention cache of the encoder. However, AFAIU encoder doesn't need kv caches. Instead it should manage decoder's cross attention kv cache, right?

Yeah I think the point is to maintain the cross-attention kv cache generated by the encoder. I believe this cache should also be included in our block managed and managed in a blocked fashion, because it's size depends on the input size, which can be highly variable.

Apart from managing cross attention kv cache in block_manager.py and implementing the model in t5.py. Are there any other components that need to change. Could you briefly describle how to implement this with minimal change?

Some points I can think of:

You might need to change the memory profiling logic in profile_num_available_blocks(). This function profiles the maximum memory usage of the model, which may need to be changed because of the encoder-decoder structure.
The blocks for encoders and the blocks for decoders may need to be stored separately, since the encoder's cross-attention cache is shared, and the decoder's cache is per layer.
For the model in t5.py, you might need to look at the input and check whether the input is a prompt run or a generation run. If it's prompt run, you call the encoder and feed <sos> to the decoder and run the first decoder run. If it's a generation run, you only call the decoder.

I believe there can be some other places in our code where we assume the model is decoder only.

Any help is appreciated. Thanks in advance!

Thanks for taking this and please let us know if there's any issue! We are also happy to chat online if you need more detailed suggestions. Feel free to shoot me an email at zhuohan[at]berkeley.edu.

shermansiu · 2023-11-24T10:34:57Z

Also, I suppose the encoder cache eviction would be different.

i.e. The encoder's cross-attention values would need to be kept as long as the decoding is active for a prompt, but can be evicted the moment the generation is completed.

shermansiu · 2023-11-24T16:16:28Z

(Never mind, for the sake of simplicity, LRU should work just fine)

simon-mo · 2024-01-26T22:14:45Z

cc @rib-2

js8544 · 2024-01-31T07:12:52Z

Update: I'm very close to finishing this. I've run T5 with vllm successfully on my local machine. I think I will be able to submit a PR in the coming weeks.

junior-zsy · 2024-02-07T03:06:25Z

@js8544 Hello, is there any progress on this now? I would like to use it. Thank you

Elsayed91 · 2024-02-28T15:44:13Z

would this include BART?

afeldman-nm · 2024-02-28T20:15:17Z

Hello @js8544 thank you so much for this work. My team is very interested in encoder/decoder.

I would like to offer to help with landing this PR. How can I assist?

Once the encoder/decoder feature is landed, our team plans to integrate Whisper (audio speech recognition) support on top of it. This motivates the interest in supporting encoder/decoder work. @zhuohan123 FYI this relates to

#180

js8544 · 2024-02-29T15:53:52Z

I just submitted a draft PR: #3117. There are still some problems to solve. I would really appreciate any comments or advice.

Elsayed91 · 2024-03-12T12:54:08Z

I tried the pull request, T5 worked but BART did not.

afeldman-nm · 2024-04-30T13:49:57Z

@Elsayed91 did you write your own BART implementation? What was the nature of the issue?

Status update on encoder/decoder models & T5:

It has become clear that the aforementioned work rightfully belongs in at least two medium-small sized PRs, rather than a single large PR:

PR 1: vLLM infrastructure to support encoder/decoder, along with unit tests

Draft PR: [WIP] Infrastructure for encoder/decoder support #4289

PR 2: Support for T5

Draft PR: TBD

My experience working on T5 integration suggests to me that T5's relative positional encoding relies on "custom attention bias" which is (1) not supported by vLLM flash_attn, (2) difficult to integrate efficiently into the existing vLLM workflow, and (3) really an entirely different task from encoder/decoder. Thus T5 support belongs in its own PR.

More on the impact which custom bias has on the outcome of working with models like T5 can be found in the comments on this post https://twitter.com/birchlabs/status/1782791645961859142?s=46

Note that Whisper support (#180) takes a dependency on encoder/decoder as well, and will also be in a separate PR.

js8544 · 2024-04-30T15:34:39Z

@Elsayed91 did you write your own BART implementation? What was the nature of the issue?

Status update on encoder/decoder models & T5:

It has become clear that the aforementioned work rightfully belongs in at least two medium-small sized PRs, rather than a single large PR:

PR 1: vLLM infrastructure to support encoder/decoder, along with unit tests

Draft PR: [WIP] Infrastructure for encoder/decoder support #4289

PR 2: Support for T5

Draft PR: TBD

My experience working on T5 integration suggests to me that T5's relative positional encoding relies on "custom attention bias" which is (1) not supported by vLLM flash_attn, (2) difficult to integrate efficiently into the existing vLLM workflow, and (3) really an entirely different task from encoder/decoder. Thus T5 support belongs in its own PR.

More on the impact which custom bias has on the outcome of working with models like T5 can be found in the comments on this post https://twitter.com/birchlabs/status/1782791645961859142?s=46

Note that Whisper support (#180) takes a dependency on encoder/decoder as well, and will also be in a separate PR.

I totally agree. The relative attention bias of T5 was very painful to implement, and is not necessary for other enc-dec models like whisper. I can add T5 support after your enc-dec infra pr is merged.

js8544 · 2024-04-30T15:37:27Z

btw Bart would be simpler than T5 because it uses the original Transformer structure. Maybe we can do Bart first.

afeldman-nm · 2024-06-03T01:22:33Z

Quick update: the PR to support cross-attention caching (#4837) has been landed. Now I am working on landing the PR to correctly invoke the attention kernel for cross-attention (#4888).

afeldman-nm · 2024-06-21T16:31:27Z

Update:

PR 2 to support kernel wrappers for encoder attention and cross attention is functional & just going through a final review pass [Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) #4888
PR 3 which completes end-to-end encoder/decoder support (ModelRunner support, Scheduler support, API support etc.) has just started being reviewed [Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) #4942
However PR 3 would benefit from an end-to-end test with an encoder/decoder model. To that end I am working on BART model integration.

anonymousz97 · 2024-07-03T04:14:00Z

Is there any documentation for inference Bart type model? Thanks.

afeldman-nm · 2024-07-03T12:24:47Z

[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) [Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) #4942

Hello @anonymousz97 , this PR

#4942

will include BART support & example code for invoking BART. This PR is WIP but should be ready for review soon.

Sapessii · 2024-07-05T19:53:48Z

hi @afeldman-nm! is this pr going to support also https://huggingface.co/facebook/bart-large-mnli?

thank you

anonymousz97 · 2024-07-08T09:52:18Z

Thanks, i will try it @afeldman-nm af

afeldman-nm · 2024-07-08T21:39:03Z

Update: #4888 is landed, enabling the xFormers backend to support encoder attention, decoder self-attention, and decoder cross-attention. #4837 and #4888 (both of which have been landed) were prerequisites for #4942 . #4942
completes end-to-end support for encoder/decoder models & also introduces the BART model into vLLM. #4942 is still WIP.

anonymousz97 · 2024-07-09T03:06:06Z

Does it support MBartForConditionalGeneration model @afeldman-nm ? Thanks

thanhlt998 · 2024-07-24T07:56:13Z

[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) [Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) #4942

Hello @anonymousz97 , this PR

#4942

will include BART support & example code for invoking BART. This PR is WIP but should be ready for review soon.

@afeldman-nm Will that PR include T5 support?

afeldman-nm · 2024-08-09T21:26:34Z

FYI encoder/decoder support has landed ( #4942 ); there is an example in examples/offline_inference_encoder_decoder.py. BART has been integrated in to vLLM (T5 and Whisper have not, to answer a previous question.)

Currently vLLM encoder/decoder support is constrained in what features it is compatible with (i.e. not CUDAGraph, not pipeline parallelism, ...) So it is now a goal to make more features compatible with vLLM's encoder/decoder processing pipeline.

To that end RFC #7366 overviews the vLLM features which are currently not compatible encoder/decoder, with an eye for bringing vLLM's encoder/decoder support to parity with vLLM's decoder-only support.

Additionally #7366 proposes adding custom attention bias support as well as the Whisper and T5 models.

The RFC feedback period is 1 week (until August 16th)

DarkLight1337 · 2024-09-12T10:12:05Z

Closing this issue in favor of #7366

yugaljain1999 · 2024-09-12T10:52:56Z

@DarkLight1337 @afeldman-nm How much time it may take to have support for T5 based models in vllm?

Your response would be appreciable.
Thanks

Upstream merge 24/09/16

saisurbehera · 2025-01-14T16:10:42Z

Any updates on this ?

DarkLight1337 · 2025-01-14T16:13:23Z

For T5 model, see #11334 + #11901, or #11470

zhuohan123 mentioned this issue Jun 21, 2023

Whisper support #180

Closed

zhuohan123 added the new model Requests to new models label Jun 21, 2023

WoosukKwon mentioned this issue Jun 24, 2023

Support for fastchat-t5-3b-v1.0 #223

Closed

zhuohan123 mentioned this issue Jun 25, 2023

[Roadmap] vLLM Development Roadmap: H2 2023 #244

Closed

76 tasks

SensorLock mentioned this issue Jul 13, 2023

Support embedding models #458

Closed

This was referenced Aug 16, 2023

T5 model support #404

Closed

T5 like encoder-decoder model support #668

Closed

akshay-anyscale mentioned this issue Nov 1, 2023

T5 model support ray-project/ray-llm#33

Open

aravindMahadevan mentioned this issue Nov 21, 2023

Encoder-Decoder model support S-LoRA/S-LoRA#17

Open

zhuohan123 mentioned this issue Jan 31, 2024

[Roadmap] vLLM Roadmap Q1 2024 #2681

Closed

30 tasks

This was referenced Mar 8, 2024

Finetuned Flan-T5 #434

Closed

Feasible to implement non-autoregressive LMs? (M2M) #1072

Closed

afeldman-nm mentioned this issue Apr 2, 2024

[WIP] Upstream encoder/decoder support based on multiple blocktables neuralmagic/nm-vllm#161

Closed

afeldman-nm mentioned this issue Apr 2, 2024

[WIP] Upstream encoder/decoder support based on multiple blocktables afeldman-nm/vllm#3

Open

yuzisun mentioned this issue Jun 23, 2024

Add examples for all the supported tasks for huggingface runtime kserve/website#377

Merged

frittentheke mentioned this issue Jul 1, 2024

[New Model]: facebook/seamless-m4t-v2-large #6017

Closed

hmellor mentioned this issue Aug 2, 2024

Adding support for switch-transformer / NLLB-MoE #1565

Closed

afeldman-nm mentioned this issue Aug 9, 2024

[RFC]: Encoder/decoder models & feature compatibility #7366

Open

DarkLight1337 closed this as completed Sep 12, 2024

mht-sharma pushed a commit to mht-sharma/vllm that referenced this issue Oct 30, 2024

Merge pull request vllm-project#187 from ROCm/upstream_merge_24_09_16

ad9026c

Upstream merge 24/09/16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for encoder-decoder models, like T5 or BART #187

Adding support for encoder-decoder models, like T5 or BART #187

shermansiu commented Jun 21, 2023

zhuohan123 commented Jun 21, 2023

shermansiu commented Jun 21, 2023

shermansiu commented Jun 21, 2023

shermansiu commented Jun 21, 2023

js8544 commented Oct 17, 2023

js8544 commented Nov 24, 2023 •

edited

Loading

zhuohan123 commented Nov 24, 2023

shermansiu commented Nov 24, 2023

shermansiu commented Nov 24, 2023

simon-mo commented Jan 26, 2024

js8544 commented Jan 31, 2024

junior-zsy commented Feb 7, 2024

Elsayed91 commented Feb 28, 2024

afeldman-nm commented Feb 28, 2024

js8544 commented Feb 29, 2024 •

edited

Loading

Elsayed91 commented Mar 12, 2024

afeldman-nm commented Apr 30, 2024 •

edited

Loading

js8544 commented Apr 30, 2024

js8544 commented Apr 30, 2024

afeldman-nm commented Jun 3, 2024

afeldman-nm commented Jun 21, 2024 •

edited

Loading

anonymousz97 commented Jul 3, 2024

afeldman-nm commented Jul 3, 2024 •

edited

Loading

Sapessii commented Jul 5, 2024 •

edited

Loading

anonymousz97 commented Jul 8, 2024

afeldman-nm commented Jul 8, 2024 •

edited

Loading

anonymousz97 commented Jul 9, 2024

thanhlt998 commented Jul 24, 2024

afeldman-nm commented Aug 9, 2024

DarkLight1337 commented Sep 12, 2024

yugaljain1999 commented Sep 12, 2024 •

edited

Loading

saisurbehera commented Jan 14, 2025

DarkLight1337 commented Jan 14, 2025 •

edited

Loading

Adding support for encoder-decoder models, like T5 or BART #187

Adding support for encoder-decoder models, like T5 or BART #187

Comments

shermansiu commented Jun 21, 2023

zhuohan123 commented Jun 21, 2023

shermansiu commented Jun 21, 2023

shermansiu commented Jun 21, 2023

shermansiu commented Jun 21, 2023

js8544 commented Oct 17, 2023

js8544 commented Nov 24, 2023 • edited Loading

zhuohan123 commented Nov 24, 2023

shermansiu commented Nov 24, 2023

shermansiu commented Nov 24, 2023

simon-mo commented Jan 26, 2024

js8544 commented Jan 31, 2024

junior-zsy commented Feb 7, 2024

Elsayed91 commented Feb 28, 2024

afeldman-nm commented Feb 28, 2024

js8544 commented Feb 29, 2024 • edited Loading

Elsayed91 commented Mar 12, 2024

afeldman-nm commented Apr 30, 2024 • edited Loading

js8544 commented Apr 30, 2024

js8544 commented Apr 30, 2024

afeldman-nm commented Jun 3, 2024

afeldman-nm commented Jun 21, 2024 • edited Loading

anonymousz97 commented Jul 3, 2024

afeldman-nm commented Jul 3, 2024 • edited Loading

Sapessii commented Jul 5, 2024 • edited Loading

anonymousz97 commented Jul 8, 2024

afeldman-nm commented Jul 8, 2024 • edited Loading

anonymousz97 commented Jul 9, 2024

thanhlt998 commented Jul 24, 2024

afeldman-nm commented Aug 9, 2024

DarkLight1337 commented Sep 12, 2024

yugaljain1999 commented Sep 12, 2024 • edited Loading

saisurbehera commented Jan 14, 2025

DarkLight1337 commented Jan 14, 2025 • edited Loading

js8544 commented Nov 24, 2023 •

edited

Loading

js8544 commented Feb 29, 2024 •

edited

Loading

afeldman-nm commented Apr 30, 2024 •

edited

Loading

afeldman-nm commented Jun 21, 2024 •

edited

Loading

afeldman-nm commented Jul 3, 2024 •

edited

Loading

Sapessii commented Jul 5, 2024 •

edited

Loading

afeldman-nm commented Jul 8, 2024 •

edited

Loading

yugaljain1999 commented Sep 12, 2024 •

edited

Loading

DarkLight1337 commented Jan 14, 2025 •

edited

Loading