Add evolla rebase main #36232

zhoubay · 2025-02-17T10:19:55Z

What does this PR do?

Fix issue #36231

Evolla is an advanced 80-billion-parameter protein-language generative model designed to decode the molecular language of proteins. It integrates information from protein sequences, structures, and user queries to generate precise and contextually nuanced insights into protein function. Trained on an unprecedented AI-generated dataset of 546 million protein question-answer pairs and 150 billion word tokens, Evolla significantly advances research in proteomics and functional genomics, providing expert-level insights and shedding light on the molecular logic encoded in proteins.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@amyeroberts @qubvel @ArthurZucker

…put`

zucchini-nlp · 2025-02-18T21:25:53Z

@Cyrilvallez hey! yeah, we have InstructBlip with 2 tokenizers, one for inputs and another for Qformer module. We added a custom from/save pretrained to get the tokenizers. On the hub, the other tokenizer is saved in a separate file called qformer_tokenizer

zhoubay · 2025-02-19T07:12:45Z

Thank you @zucchini-nlp and @Cyrilvallez !

I've solved the problem by refering the code from instructblip! However, even I pass the test_processor, I still failed in tests_generate because of tests/models/video_llava/test_modeling_video_llava.py::VideoLlavaForConditionalGenerationModelTest::test_assisted_decoding_with_logits_to_keep which is passed in my environment.

What should I do to progress the process?

As for the modular issue @Cyrilvallez mentioned, is it compulsory to change it to that version? I'm reading the docs, and very favor to apply it to reduce my reused codes without hurting the performance!

zucchini-nlp · 2025-02-19T08:39:17Z

Feel free to ignore tests from other models. The test failing for you was fixed yesterday, so rebasing might help. If there are other flaky tests with vision LLMs, let me know, I am working on decreasing flakiness to minimum :)

zhoubay · 2025-02-19T08:53:31Z

@zucchini-nlp Thanks! As for the instructblip, I've noticed a little flakiness with their implementing.

For the referenced lines, they removed qformer_tokenizer and then append it as the last item of self.attributes. However, this would work only because the qformer_tokenizer is originally the last of the self.attributes, see here.

# We modify the attributes so that only the tokenizer and image processor are saved in the main folder
qformer_present = "qformer_tokenizer" in self.attributes
if qformer_present:
    self.attributes.remove("qformer_tokenizer")

outputs = super().save_pretrained(save_directory, **kwargs)

if qformer_present:
    self.attributes += ["qformer_tokenizer"]

This might cause unexpected error when you change the order of the self.attributes to, say, attributes = ["image_processor", "qformer_tokenizer", "tokenizer"], and at the same time change the signiture of __init__ into def __init__(self, image_processor, qformer_tokenizer, tokenizer, num_query_tokens=None, **kwargs): (since the attributes should have same order of the __init__ function). If you change that way, there'll be an unexpected error.

My proposal is as my recent commit, add a variable to track where the popped out attribute is, as shown below:

# we modify the attributes so that only the text tokenizer are saved in the main folder
protein_tokenizer_present = "protein_tokenizer" in self.attributes
# find the correct position of it in the attributes list
protein_tokenizer_index = self.attributes.index("protein_tokenizer") if protein_tokenizer_present else None
if protein_tokenizer_present and protein_tokenizer_index is not None:
    self.attributes.remove("protein_tokenizer")

outputs = super().save_pretrained(save_directory, **kwargs)

if protein_tokenizer_present and protein_tokenizer_index is not None:
    self.attributes.insert(protein_tokenizer_index, "protein_tokenizer")

Hope this would help to make a more stable vision LLMs!

And I'll rebase the main branch and see how it going. Thank you again!

zucchini-nlp · 2025-02-19T09:00:54Z

@zhoubay yeah, but we don't expect users to change the order of attributes unless they are doing some advanced experiments and adding new attributes. In that case we'll expect them to also change the custom saving and loading. Usually we just have a tokenizer + image processor, one for each attribute, so not much opportunity to experiment 😄

For Evolla feel free to do what you think works, I am not familiar with the model. You will get a review from Cyril on that later

zhoubay · 2025-02-19T09:08:48Z

@zucchini-nlp Thank you for your information and your collaboration! I'm looking forward to it!

Cyrilvallez · 2025-02-19T09:37:28Z

Hey @zhoubay! Yes, we do enforce the use of modular now, especially as it is very much applicable in your case (a lot of code seem directly copy-pasted from other models)! In the end, it is a huge win-win scenario, as it greatly facilitates our review and maintenance process, and in turn ensures your model will get all potential future features/improvements we may roll out (which is otherwise not the case for all models, as model are supposed to stay more or less static in time) 🤗

Also, not sure what you did but you now have more than 150 changed files! 🙃 You should revert to previous state!

zhoubay · 2025-02-19T09:44:45Z

Hi @Cyrilvallez ! Thank you for your information! I'll convert my codes using modules!

As for the file changes, I think most of them are involved because I just rebased my branch to the current main branch. Could you show me some clues about how to change it back?

zhoubay · 2025-02-19T10:34:02Z

Hi @Cyrilvallez, I hope this would help! Sorry I messed up the history of commit! I'll never never never do rebase blindly again 😭

Cyrilvallez · 2025-02-19T10:59:43Z

No worries, we've all been there 😆 Let me know when you integrated the modular files and are ready for a review!

zhoubay · 2025-02-19T11:05:32Z

Sure! Thank you! I'll let you know when I finish integrating the modular files!

Cyrilvallez · 2025-02-19T19:56:16Z

Just a heads-up, if you make 2 modular it will likely not work right now, but #36287 will fix it! I will merge asap

zhoubay · 2025-02-20T02:45:01Z

Hey @Cyrilvallez ! I've got a little question about modular! Since the modular relies on the codes from other people, what happens if they have changed the file making the performance differently?

For example, in my case, I inherited the llama code long long ago, like transformers:v4.30.2. And at that time, the signiture in LlamaDecoderLayer.forward has no position_embeddings. see here.

def forward(
    self,
    hidden_states: torch.Tensor,
    attention_mask: Optional[torch.Tensor] = None,
    position_ids: Optional[torch.LongTensor] = None,
    past_key_value: Optional[Tuple[torch.Tensor]] = None,
    output_attentions: Optional[bool] = False,
    use_cache: Optional[bool] = False,
) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:

And at that time, the decoder_layer won't accept position_embeddings as input (see here), because they compute position_embeddings inside LlamaAttention (see here).

  layer_outputs = decoder_layer(
      hidden_states,
      attention_mask=attention_mask,
      position_ids=position_ids,
      past_key_value=past_key_value,
      output_attentions=output_attentions,
      use_cache=use_cache,
  )

However, for now, the decoder_layer must accept the position_embeddings as input (see here).

  layer_outputs = decoder_layer(
      hidden_states,
      attention_mask=causal_mask,
      position_ids=position_ids,
      past_key_value=past_key_values,
      output_attentions=output_attentions,
      use_cache=use_cache,
      cache_position=cache_position,
      position_embeddings=position_embeddings,
      **flash_attn_kwargs,
  )

My point is, for my case, I just need to implement the EvollaModel.forward in fine-grain, so I have to copy all of the codes of LlamaModel.forward at that time of point, and I won't change anything in LlamaDecoderLayer, so I can inherit it.

So my code of running decoder_layer would like previous one. This is implemented by me, so when they (llama team) change detailed implementations, these lines wouldn't change.

  layer_outputs = decoder_layer(
      hidden_states,
      attention_mask=attention_mask,
      position_ids=position_ids,
      past_key_value=past_key_value,
      output_attentions=output_attentions,
      use_cache=use_cache,
  )

But over some time, the llama team changed the signiture of decoder_layer and position_embeddings are compulsory (see here), my codes would suffer because of inheriting their LlamaDecoderLayer and wouldn't work.

def forward(
    self,
    hidden_states: torch.Tensor,
    attention_mask: Optional[torch.Tensor] = None,
    position_ids: Optional[torch.LongTensor] = None,
    past_key_value: Optional[Cache] = None,
    output_attentions: Optional[bool] = False,
    use_cache: Optional[bool] = False,
    cache_position: Optional[torch.LongTensor] = None,
    position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,  # necessary, but kept here for BC
    **kwargs: Unpack[FlashAttentionKwargs],
) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:

I don't know if I made myself clear, if you have any question about what I said, please let me know!

Looking forward to your idea to solve this issue! Thank you in advance!

…e_protein_reasoning`

zhoubay · 2025-02-24T03:37:35Z

Hi @Cyrilvallez , I made conversion to modular for llama parts and I found there's a typo in the doc, as issue #36362 pointed out.

I fixed the typo in the doc, hope this wouldn't bother you!

Let me know if I should do more things!

zhoubay and others added 30 commits February 17, 2025 17:53

add evolla

1d845d7

adding protein encoder part

ca529f1

add initial processing test

d51ed4c

save processor

5ee880c

add docstring

a8cfa7b

add evolla processor

7c6d65e

add two test

14933d2

change vision to protein

92b6437

change resampler to sequence_compressor

6365a04

change vision to protein

384157a

initial update for llama

4e24ff2

add initial update for llamaForCausalLM

89427de

add test_processor, test_saprot_output, `test_protein_encoder_out…

980a5df

…put`

change evolla, but still working on it

c8688c1

add test_single_forward

9d721bf

pass test_attention_outputs

3339303

pass test_hidden_states_output

b491b59

pass test_save_load and test_from_pretrained_no_checkpoint

a501aa8

pass test_cpu_offload

acd1d9f

skip some tests

8bc3f53

update new progress

cbe4f8f

skip test_model_is_small

f7b17b3

pass test_model_weights_reload_no_missing_tied_weights

b2d1d74

pass test_model_get_set_embeddings

5e70485

pass test_cpu_offload

6bcaf19

skip test_resize_embeddings

fa7e8de

add pipeline_model_mapping

32941f3

remote old setUp

bc9f17b

pass processor save_pretrained and load_pretrained

c151414

remove pooling layer

7c93581

solve processor save load problem from instructblip solution

f54bb15

Merge branch 'upstream_main' into add-evolla-rebase-main

9bf9d35

zhoubay force-pushed the add-evolla-rebase-main branch from 06ff185 to 9bf9d35 Compare February 19, 2025 10:28

remove unexpected file

8408434

Merge branch 'main' into add-evolla-rebase-main

164cb12

zhoubay added 8 commits February 20, 2025 17:04

Merge branch 'main' into add-evolla-rebase-main

f61dc43

Merge branch 'main' into add-evolla-rebase-main

f1e42d9

Merge branch 'main' into add-evolla-rebase-main

54ff889

skip test_generation_tester_mixin_inheritance

df3360b

add test_single_forward_correct and `test_inference_natural_languag…

34021d8

…e_protein_reasoning`

add modular_evolla.py

4a5de73

solved issue huggingface#36362

b9a9de9

run make fixup

f9e08aa

zhoubay added 3 commits February 25, 2025 10:19

Merge branch 'main' into add-evolla-rebase-main

3b84f23

Merge branch 'main' into add-evolla-rebase-main

e6771ae

Merge branch 'main' into add-evolla-rebase-main

c97d184

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add evolla rebase main #36232

Add evolla rebase main #36232

zhoubay commented Feb 17, 2025

zucchini-nlp commented Feb 18, 2025

zhoubay commented Feb 19, 2025

zucchini-nlp commented Feb 19, 2025

zhoubay commented Feb 19, 2025

zucchini-nlp commented Feb 19, 2025

zhoubay commented Feb 19, 2025

Cyrilvallez commented Feb 19, 2025

zhoubay commented Feb 19, 2025

zhoubay commented Feb 19, 2025

Cyrilvallez commented Feb 19, 2025

zhoubay commented Feb 19, 2025

Cyrilvallez commented Feb 19, 2025

zhoubay commented Feb 20, 2025 •

edited

Loading

zhoubay commented Feb 24, 2025

Add evolla rebase main #36232

Are you sure you want to change the base?

Add evolla rebase main #36232

Conversation

zhoubay commented Feb 17, 2025

What does this PR do?

Before submitting

Who can review?

zucchini-nlp commented Feb 18, 2025

zhoubay commented Feb 19, 2025

zucchini-nlp commented Feb 19, 2025

zhoubay commented Feb 19, 2025

zucchini-nlp commented Feb 19, 2025

zhoubay commented Feb 19, 2025

Cyrilvallez commented Feb 19, 2025

zhoubay commented Feb 19, 2025

zhoubay commented Feb 19, 2025

Cyrilvallez commented Feb 19, 2025

zhoubay commented Feb 19, 2025

Cyrilvallez commented Feb 19, 2025

zhoubay commented Feb 20, 2025 • edited Loading

zhoubay commented Feb 24, 2025

zhoubay commented Feb 20, 2025 •

edited

Loading