tokenizers.apply_chat_template with `continue_final_message=True` with trailing spaces in input #35433

chuyishang · 2024-12-27T07:24:56Z

System Info

As title says, tokenizers.apply_chat_template fails with trailing spaces in input for Llama-3.1-Instruct.

If the last assistant message has a trailing space, such as
{'role': 'assistant', 'content': 'some text '}

and continue_final_message is True, it throws a "ValueError: substring not found"

This is because in the apply_chat_template function, there is a line
rendered_chat = rendered_chat[: rendered_chat.rindex(final_message) + len(final_message)].rstrip()

but rendered_chat ends with "some text<|eot_id|>" while the final_message still has the trailing space: "some text "

Who can help?

@ArthurZucker @itazap

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

See above

Expected behavior

I expect it to be able to continue after the trailing space

The text was updated successfully, but these errors were encountered:

LysandreJik · 2024-12-29T14:15:14Z

cc @Rocketknight1

Rocketknight1 · 2025-01-06T18:33:34Z

Hi @chuyishang, this should be fixed now! Please try updating to the latest version, or installing from main if that doesn't work.

TK-21st · 2025-02-22T21:04:51Z

@Rocketknight1
There is a new issue introduced from this fix.

When using multi-round few-shot such as

[
    {"role": "user", "content": f"hi 0"},
    {"role": "assistant", "content": f"bye: 0"},
    {"role": "user", "content": f"hi 1"},
    {"role": "assistant", "content": f"bye: "}
]

The rendered chat with tokenizer.apply_chat_template(messages, tokenize=False, continue_final_message=True, add_generation_prompt=False) will actually completely remove the last round if the tokenizer strips the text. Giving you

<bos><start_of_turn>user
hi 0<end_of_turn>
<start_of_turn>model
bye:

This is because the last round "bye: " is only found in the penultimate round since the last round is now just "bye:" (no white space).

I'm not sure what would be the best way to remedy this. Maybe max(rindex(final_message), rindex(final_message.strip()))?

Rocketknight1 · 2025-02-24T18:44:08Z

@TK-21st yes, you're completely correct and I hadn't considered that! Working on a high-priority fix now

Rocketknight1 · 2025-02-25T16:59:52Z

Hi @TK-21st, I've opened a PR to fix this at #36404. After investigation, the error you've found is very template-specific. Can you tell me what template you were using? I'll add a test using it to the transformers codebase so we don't regress on this

TK-21st · 2025-02-25T18:05:25Z

Hi @TK-21st, I've opened a PR to fix this at #36404. After investigation, the error you've found is very template-specific. Can you tell me what template you were using? I'll add a test using it to the transformers codebase so we don't regress on this

Great thanks!
I was using google/gemma-2-2b-it

chuyishang added the bug label Dec 27, 2024

LysandreJik added the Chat Template label Dec 29, 2024

Rocketknight1 closed this as completed Jan 6, 2025

Rocketknight1 reopened this Feb 25, 2025

Rocketknight1 linked a pull request Feb 25, 2025 that will close this issue

Fix edge case for continue_final_message #36404

Open

isamu-isozaki mentioned this issue Feb 27, 2025

tokenizers.apply_chat_template with continue_final_message=True with </think> token #36440

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenizers.apply_chat_template with `continue_final_message=True` with trailing spaces in input #35433

tokenizers.apply_chat_template with `continue_final_message=True` with trailing spaces in input #35433

chuyishang commented Dec 27, 2024 •

edited

Loading

LysandreJik commented Dec 29, 2024

Rocketknight1 commented Jan 6, 2025

TK-21st commented Feb 22, 2025

Rocketknight1 commented Feb 24, 2025

Rocketknight1 commented Feb 25, 2025

TK-21st commented Feb 25, 2025

tokenizers.apply_chat_template with continue_final_message=True with trailing spaces in input #35433

tokenizers.apply_chat_template with continue_final_message=True with trailing spaces in input #35433

Comments

chuyishang commented Dec 27, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

LysandreJik commented Dec 29, 2024

Rocketknight1 commented Jan 6, 2025

TK-21st commented Feb 22, 2025

Rocketknight1 commented Feb 24, 2025

Rocketknight1 commented Feb 25, 2025

TK-21st commented Feb 25, 2025

tokenizers.apply_chat_template with `continue_final_message=True` with trailing spaces in input #35433

tokenizers.apply_chat_template with `continue_final_message=True` with trailing spaces in input #35433

chuyishang commented Dec 27, 2024 •

edited

Loading