We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
transformers version: 4.49.0
transformers
Platform: Linux-5.14.0-427.42.1.el9_4.x86_64-x86_64-with-glibc2.34
Python version: 3.10.16
Huggingface_hub version: 0.29.1
Safetensors version: 0.5.3
Accelerate version: 1.4.0
Accelerate config: not found
DeepSpeed version: not installed
PyTorch version (GPU?): 2.6.0+cu124 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: No
tiktoken==0.9.0
@ArthurZucker @itazap
examples
On the tiktoken integration page, the Create tiktoken tokenizer section raises an error within the convert_tiktoken_to_fast function.
Create tiktoken tokenizer
convert_tiktoken_to_fast
Full example:
from pathlib import Path from transformers.integrations.tiktoken import convert_tiktoken_to_fast from tiktoken import encoding_name_for_model encoding = encoding_name_for_model("o1") outdir = Path("tmp") / "tiktoken" / encoding outdir.mkdir(parents=True, exist_ok=True) convert_tiktoken_to_fast(encoding, outdir)
Raises this error:
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[24], line 10 7 outdir = Path("tmp") / "tiktoken" / encoding 8 outdir.mkdir(parents=True, exist_ok=True) ---> 10 convert_tiktoken_to_fast(encoding, outdir) File ~/miniforge3/envs/tok/lib/python3.10/site-packages/transformers/integrations/tiktoken.py:42, in convert_tiktoken_to_fast(encoding, output_dir) 37 except ImportError: 38 raise ValueError("`tiktoken` is required to save a `tiktoken` file. Install it with `pip install tiktoken`.") 40 tokenizer = TikTokenConverter( 41 vocab_file=save_file_absolute, pattern=encoding._pat_str, additional_special_tokens=encoding._special_tokens ---> 42 ).converted() 43 tokenizer.save(output_file_absolute) File ~/miniforge3/envs/tok/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py:1630, in TikTokenConverter.converted(self) 1623 tokenizer.pre_tokenizer = pre_tokenizers.Sequence( 1624 [ 1625 pre_tokenizers.Split(Regex(self.pattern), behavior="isolated", invert=False), 1626 pre_tokenizers.ByteLevel(add_prefix_space=self.add_prefix_space, use_regex=False), 1627 ] 1628 ) 1629 tokenizer.decoder = decoders.ByteLevel() -> 1630 tokenizer.add_special_tokens(self.additional_special_tokens) 1632 tokenizer.post_processor = processors.ByteLevel(trim_offsets=False) 1634 return tokenizer TypeError: argument 'tokens': 'dict' object cannot be converted to 'PyList'
No error. A tokenizer.json is written to the directory and can be loaded into a PreTrainedTokenizerFast object as in the example.
tokenizer.json
PreTrainedTokenizerFast
The text was updated successfully, but these errors were encountered:
No branches or pull requests
System Info
transformers
version: 4.49.0Platform: Linux-5.14.0-427.42.1.el9_4.x86_64-x86_64-with-glibc2.34
Python version: 3.10.16
Huggingface_hub version: 0.29.1
Safetensors version: 0.5.3
Accelerate version: 1.4.0
Accelerate config: not found
DeepSpeed version: not installed
PyTorch version (GPU?): 2.6.0+cu124 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: No
tiktoken==0.9.0
Who can help?
@ArthurZucker @itazap
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
On the tiktoken integration page, the
Create tiktoken tokenizer
section raises an error within theconvert_tiktoken_to_fast
function.Full example:
Raises this error:
Expected behavior
No error. A
tokenizer.json
is written to the directory and can be loaded into aPreTrainedTokenizerFast
object as in the example.The text was updated successfully, but these errors were encountered: