Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in tiktoken integration example #36438

Open
1 of 4 tasks
kndtran opened this issue Feb 26, 2025 · 0 comments
Open
1 of 4 tasks

Error in tiktoken integration example #36438

kndtran opened this issue Feb 26, 2025 · 0 comments
Labels

Comments

@kndtran
Copy link

kndtran commented Feb 26, 2025

System Info

  • transformers version: 4.49.0

  • Platform: Linux-5.14.0-427.42.1.el9_4.x86_64-x86_64-with-glibc2.34

  • Python version: 3.10.16

  • Huggingface_hub version: 0.29.1

  • Safetensors version: 0.5.3

  • Accelerate version: 1.4.0

  • Accelerate config: not found

  • DeepSpeed version: not installed

  • PyTorch version (GPU?): 2.6.0+cu124 (False)

  • Tensorflow version (GPU?): not installed (NA)

  • Flax version (CPU?/GPU?/TPU?): not installed (NA)

  • Jax version: not installed

  • JaxLib version: not installed

  • Using distributed or parallel set-up in script?: No

  • tiktoken==0.9.0

Who can help?

@ArthurZucker @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

On the tiktoken integration page, the Create tiktoken tokenizer section raises an error within the convert_tiktoken_to_fast function.

Full example:

from pathlib import Path

from transformers.integrations.tiktoken import convert_tiktoken_to_fast
from tiktoken import encoding_name_for_model

encoding = encoding_name_for_model("o1")
outdir = Path("tmp") / "tiktoken" / encoding
outdir.mkdir(parents=True, exist_ok=True)

convert_tiktoken_to_fast(encoding, outdir)

Raises this error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[24], line 10
      7 outdir = Path("tmp") / "tiktoken" / encoding
      8 outdir.mkdir(parents=True, exist_ok=True)
---> 10 convert_tiktoken_to_fast(encoding, outdir)

File ~/miniforge3/envs/tok/lib/python3.10/site-packages/transformers/integrations/tiktoken.py:42, in convert_tiktoken_to_fast(encoding, output_dir)
     37 except ImportError:
     38     raise ValueError("`tiktoken` is required to save a `tiktoken` file. Install it with `pip install tiktoken`.")
     40 tokenizer = TikTokenConverter(
     41     vocab_file=save_file_absolute, pattern=encoding._pat_str, additional_special_tokens=encoding._special_tokens
---> 42 ).converted()
     43 tokenizer.save(output_file_absolute)

File ~/miniforge3/envs/tok/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py:1630, in TikTokenConverter.converted(self)
   1623 tokenizer.pre_tokenizer = pre_tokenizers.Sequence(
   1624     [
   1625         pre_tokenizers.Split(Regex(self.pattern), behavior="isolated", invert=False),
   1626         pre_tokenizers.ByteLevel(add_prefix_space=self.add_prefix_space, use_regex=False),
   1627     ]
   1628 )
   1629 tokenizer.decoder = decoders.ByteLevel()
-> 1630 tokenizer.add_special_tokens(self.additional_special_tokens)
   1632 tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)
   1634 return tokenizer

TypeError: argument 'tokens': 'dict' object cannot be converted to 'PyList'

Expected behavior

No error. A tokenizer.json is written to the directory and can be loaded into a PreTrainedTokenizerFast object as in the example.

@kndtran kndtran added the bug label Feb 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant