Error in tiktoken integration example #36438

kndtran · 2025-02-26T21:12:01Z

System Info

transformers version: 4.49.0
Platform: Linux-5.14.0-427.42.1.el9_4.x86_64-x86_64-with-glibc2.34
Python version: 3.10.16
Huggingface_hub version: 0.29.1
Safetensors version: 0.5.3
Accelerate version: 1.4.0
Accelerate config: not found
DeepSpeed version: not installed
PyTorch version (GPU?): 2.6.0+cu124 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: No
tiktoken==0.9.0

Who can help?

@ArthurZucker @itazap

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

On the tiktoken integration page, the Create tiktoken tokenizer section raises an error within the convert_tiktoken_to_fast function.

Full example:

from pathlib import Path

from transformers.integrations.tiktoken import convert_tiktoken_to_fast
from tiktoken import encoding_name_for_model

encoding = encoding_name_for_model("o1")
outdir = Path("tmp") / "tiktoken" / encoding
outdir.mkdir(parents=True, exist_ok=True)

convert_tiktoken_to_fast(encoding, outdir)

Raises this error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[24], line 10
      7 outdir = Path("tmp") / "tiktoken" / encoding
      8 outdir.mkdir(parents=True, exist_ok=True)
---> 10 convert_tiktoken_to_fast(encoding, outdir)

File ~/miniforge3/envs/tok/lib/python3.10/site-packages/transformers/integrations/tiktoken.py:42, in convert_tiktoken_to_fast(encoding, output_dir)
     37 except ImportError:
     38     raise ValueError("`tiktoken` is required to save a `tiktoken` file. Install it with `pip install tiktoken`.")
     40 tokenizer = TikTokenConverter(
     41     vocab_file=save_file_absolute, pattern=encoding._pat_str, additional_special_tokens=encoding._special_tokens
---> 42 ).converted()
     43 tokenizer.save(output_file_absolute)

File ~/miniforge3/envs/tok/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py:1630, in TikTokenConverter.converted(self)
   1623 tokenizer.pre_tokenizer = pre_tokenizers.Sequence(
   1624     [
   1625         pre_tokenizers.Split(Regex(self.pattern), behavior="isolated", invert=False),
   1626         pre_tokenizers.ByteLevel(add_prefix_space=self.add_prefix_space, use_regex=False),
   1627     ]
   1628 )
   1629 tokenizer.decoder = decoders.ByteLevel()
-> 1630 tokenizer.add_special_tokens(self.additional_special_tokens)
   1632 tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)
   1634 return tokenizer

TypeError: argument 'tokens': 'dict' object cannot be converted to 'PyList'

Expected behavior

No error. A tokenizer.json is written to the directory and can be loaded into a PreTrainedTokenizerFast object as in the example.

The text was updated successfully, but these errors were encountered:

kndtran added the bug label Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in tiktoken integration example #36438

Error in tiktoken integration example #36438

kndtran commented Feb 26, 2025

Error in tiktoken integration example #36438

Error in tiktoken integration example #36438

Comments

kndtran commented Feb 26, 2025

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior