Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizers.apply_chat_template with continue_final_message=True with trailing spaces in input #35433

Open
1 of 4 tasks
chuyishang opened this issue Dec 27, 2024 · 6 comments · May be fixed by #36404
Open
1 of 4 tasks

Comments

@chuyishang
Copy link

chuyishang commented Dec 27, 2024

System Info

As title says, tokenizers.apply_chat_template fails with trailing spaces in input for Llama-3.1-Instruct.

If the last assistant message has a trailing space, such as
{'role': 'assistant', 'content': 'some text '}

and continue_final_message is True, it throws a "ValueError: substring not found"

This is because in the apply_chat_template function, there is a line
rendered_chat = rendered_chat[: rendered_chat.rindex(final_message) + len(final_message)].rstrip()

but rendered_chat ends with "some text<|eot_id|>" while the final_message still has the trailing space: "some text "

Who can help?

@ArthurZucker @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

See above

Expected behavior

I expect it to be able to continue after the trailing space

@LysandreJik
Copy link
Member

cc @Rocketknight1

@Rocketknight1
Copy link
Member

Hi @chuyishang, this should be fixed now! Please try updating to the latest version, or installing from main if that doesn't work.

@TK-21st
Copy link

TK-21st commented Feb 22, 2025

@Rocketknight1
There is a new issue introduced from this fix.

When using multi-round few-shot such as

[
    {"role": "user", "content": f"hi 0"},
    {"role": "assistant", "content": f"bye: 0"},
    {"role": "user", "content": f"hi 1"},
    {"role": "assistant", "content": f"bye: "}
]

The rendered chat with tokenizer.apply_chat_template(messages, tokenize=False, continue_final_message=True, add_generation_prompt=False) will actually completely remove the last round if the tokenizer strips the text. Giving you

<bos><start_of_turn>user
hi 0<end_of_turn>
<start_of_turn>model
bye: 

This is because the last round "bye: " is only found in the penultimate round since the last round is now just "bye:" (no white space).

I'm not sure what would be the best way to remedy this. Maybe max(rindex(final_message), rindex(final_message.strip()))?

@Rocketknight1
Copy link
Member

@TK-21st yes, you're completely correct and I hadn't considered that! Working on a high-priority fix now

@Rocketknight1 Rocketknight1 reopened this Feb 25, 2025
@Rocketknight1 Rocketknight1 linked a pull request Feb 25, 2025 that will close this issue
@Rocketknight1
Copy link
Member

Hi @TK-21st, I've opened a PR to fix this at #36404. After investigation, the error you've found is very template-specific. Can you tell me what template you were using? I'll add a test using it to the transformers codebase so we don't regress on this

@TK-21st
Copy link

TK-21st commented Feb 25, 2025

Hi @TK-21st, I've opened a PR to fix this at #36404. After investigation, the error you've found is very template-specific. Can you tell me what template you were using? I'll add a test using it to the transformers codebase so we don't regress on this

Great thanks!
I was using google/gemma-2-2b-it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants