Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anthropic models #27

Closed
enricoros opened this issue Apr 8, 2023 · 4 comments
Closed

Anthropic models #27

enricoros opened this issue Apr 8, 2023 · 4 comments

Comments

@enricoros
Copy link

Anthropic has released the models for research, and has opened their code on GitHub:
https://github.com/anthropics/anthropic-sdk-python/blob/main/anthropic/tokenizer.py

In this repo, there's a link to a file:
CLAUDE_TOKENIZER_REMOTE_FILE = "https://public-json-tokenization-0d8763e8-0d7e-441b-a1e2-1c73b8e79dc3.storage.googleapis.com/claude-v1-tokenization.json"

Can this help in extending Tiktoken to support 'claude-v1' models?

@dqbd
Copy link
Owner

dqbd commented Apr 8, 2023

Hi @enricoros!

Looking at the tokenizer.py file, it seems like they are using huggingface/tokenizer, which do already have NodeJS bindings: https://www.npmjs.com/package/tokenizers.

Here is an example code using huggingface/tokenizer:

const util = require("util");
const { Tokenizer } = require("tokenizers");

let tokenizer = Tokenizer.fromFile("claude-v1-tokenization.json");

const encode = util.promisify(tokenizer.encode.bind(tokenizer));
const decode = util.promisify(tokenizer.decode.bind(tokenizer));

async function main() {
  const encoded = await encode("Hello from Anthropic!");
  console.log({ encoded: encoded.getIds() });

  const decoded = await decode(
    encoded.getIds(),
    true // skipSpecialTokens: true
  );

  console.log({ decoded });
}

main();

However, it does seem that the huggingface/tokenizers have some issues with supporting newer NodeJS versions and/or arm64 support. Will look into it, if there is some overlap between tiktoken and default tokenizer.

huggingface/tokenizers#911

@darknoon
Copy link

It's cool that afaict the core is also in rust?
I could see this following a similar pattern to what we have with tiktoken :D

@dqbd
Copy link
Owner

dqbd commented Apr 12, 2023

@enricoros Some progress (with experimental JSON configs for @dqbd/tiktoken) can be seen here: dqbd/tiktokenizer#5

Demo of Tiktokenizer playground: https://tiktokenizer-git-custom-bpe-models-dqbd.vercel.app/

@enricoros
Copy link
Author

Very interesting approach, and I love the playground too. Thanks for the update!
I believe this bug can be closed now, as you got it to work!

@dqbd dqbd closed this as completed Aug 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants