-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix PretrainedTokenizerFast check #35835
base: main
Are you sure you want to change the base?
Fix PretrainedTokenizerFast check #35835
Conversation
@ArthurZucker @itazap Pretty sure the PR fixes a class string compare bug vs correct The existing check for class is instance of The only problem is we had to inject a code level import due to circular import. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I am not sure it's worth "fixing" as it does not cause any issues appart from "consistency" but deserializing works perfectly even if you dont'have tokenizers, while otherwise I am not sure this is guarenteed.
@@ -2456,8 +2456,10 @@ def save_pretrained( | |||
|
|||
# Add tokenizer class to the tokenizer config to be able to reload it with from_pretrained | |||
tokenizer_class = self.__class__.__name__ | |||
# import here to prevent circular import error | |||
from .tokenization_utils_fast import PreTrainedTokenizerFast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would love to avoid this, it's pretty weird
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would love to avoid this, it's pretty weird
Another option is to remove this block of Fast
strip code completedly. Is this legacy code still needed now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are now using different, clean, method to perform the same fix.
@ArthurZucker Please re-review. We are now using different, cleaner method to fix this bug and we do consider this a bug, and a rather severe bug as well: as the loading of re-saved tokenizer has fundamentally changed vs original. # Remove the Fast at the end if we can save the slow tokenizer
if tokenizer_class.endswith("Fast") and (
hasattr(self, "can_save_slow_tokenizer") and self.can_save_slow_tokenizer
): Some models like recently released Full Context: We found this bug after using GPTQModel to quantize various DeepSeek models and using EvalPlus to run benchmarks on it. EvalPlus always load slow tokenizer and pass Please execute the following simple reproducer to re-create the tokenizer load with import tempfile
from transformers import AutoTokenizer
# this model has no vocab file, use LlamaTokenizerFast
model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
with tempfile.TemporaryDirectory() as temp_dir:
# 'tokenizer_class' in tokenizer_config.json is LlamaTokenizerFast now
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True) # <-- works
# print(type(tokenizer)) => LlamaTokenizerFast
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False) # <-- works
# print(type(tokenizer)) => LlamaTokenizerFast
tokenizer.save_pretrained(temp_dir)
# 'tokenizer_class' in tokenizer_config.json is LlamaTokenizer now, and no vocab file saved
tokenizer = AutoTokenizer.from_pretrained(temp_dir, use_fast=False) # <-- crashes
# can't load the tokenizer because saved tokenizer path has no vocab file
Lastly, as we are not experts at tokenizer loading, this PR may also be fixing the correct bug at the wrong place. This possibility has not escaped me and I have a tingling feeling there is more to this bug than just this PR. |
@ArthurZucker After re-looking at the latest code changes, can we just remove these check and strip altogether? Downgrading the tokenizer class type does break things as we have detailed above. It just makes more sense to remove this block of code completely rather than patch-fixing? What do you think. |
@ArthurZucker Sorry to to flood but we checked again both tokenizer
Ref: elif config_tokenizer_class is not None:
tokenizer_class = None
if use_fast and not config_tokenizer_class.endswith("Fast"):
tokenizer_class_candidate = f"{config_tokenizer_class}Fast"
tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
if tokenizer_class is None:
tokenizer_class_candidate = config_tokenizer_class
tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
if tokenizer_class is None:
raise ValueError(
f"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported."
)
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) |
if tokenizer_class.endswith("Fast") and tokenizer_class != "PreTrainedTokenizerFast": | ||
# Remove the Fast at the end if we can save the slow tokenizer | ||
if tokenizer_class.endswith("Fast") and ( | ||
hasattr(self, "can_save_slow_tokenizer") and self.can_save_slow_tokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can have this check shorter with getattr
:
hasattr(self, "can_save_slow_tokenizer") and self.can_save_slow_tokenizer | |
getattr(self, "can_save_slow_tokenizer", False) |
FIX #35832
@Qubitium