Feature: Allow for overriding pad_token_id #15
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Currently, the library selects
eos
when no pad token is defined in the tokenizer. This negatively impacts finetuning some Llama models such as Mistral-Instruct since it useseos
to end it's answer in a chat setting. I believe this is because wheneos
is a pad token, it'll be ignored (attention mask to 0) and the model might forget usingeos
as training progresses.To fix this, we can now override the pad token with anything we wish. In my case, I have something like this:
This does not update the tokenizer of the final model, and I noticed positive results during inference compared to training without it. I hope this PR finds you well and thank you for such an awesome library, it's now my daily driver!
Type of Change
Checklist
make codestyle
.