Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace gpt3-tokenizer with js-tiktoken for Enhanced Tokenization Performance #112

Merged
merged 5 commits into from
Sep 14, 2024
Merged

Conversation

Anajrim01
Copy link
Contributor

@Anajrim01 Anajrim01 commented Sep 12, 2024

This PR replaces the existing gpt3-tokenizer with js-tiktoken, which has demonstrated improved performance in my benchmarks. Additionally, an exploration and comparison of BPE versus heap sorting using js-tiktoken have been conducted, with results indicating BPE (stock) provides better performance.

Benchmarks:
Screenshots of benchmark results are attached, providing a comprehensive comparison between the methods tested.

Key Changes:

  • Refactor codebase to utilize js-tiktoken for tokenization.
  • Update dependencies to their latest versions to ensure compatibility and stability.
  • Implement and test faster tokenization methods.
  • Benchmark heap sorting versus BPE sorting through js-tiktoken as described in this PR.
  • Based on benchmarks, BPE sorting is retained due to its superior performance over heap sorting.

To-Do:

  • Refactor code to integrate js-tiktoken.
  • Update project dependencies to the latest versions.
  • Implement faster tokenization using js-tiktoken.
  • Conduct benchmarks comparing heap sorting with BPE sorting.
  • Fine-tune and ensure optimal performance of BPE sorting (no further action required on heap sorting based on current benchmarks).

This PR should also fix #8, addressing current performance concerns and paving the way for future optimizations.

Attachments:

  • Benchmark results screenshot comparing old tokenizer (gpt3-tokenizer) with js-tiktoken along with omparison of BPE versus heap sorting results, demonstrating BPE's superior performance.

Benchmark Results - gpt3-tokenizer vs js-tiktoken

  • Conclusion

conclusion_avg

Please review the attached documentation and benchmark results to provide feedback or suggest further enhancements.

Copy link

vercel bot commented Sep 12, 2024

@Anajrim01 is attempting to deploy a commit to the ShipBit Team on Vercel.

A member of the Team first needs to authorize it.

Copy link

vercel bot commented Sep 12, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
slickgpt ✅ Ready (Inspect) Visit Preview 💬 Add feedback Sep 14, 2024 9:24am

@Anajrim01 Anajrim01 changed the title Use faster more correct tokenizer to calculate tokens Replace gpt3-tokenizer with js-tiktoken for Enhanced Tokenization Performance Sep 12, 2024
@Anajrim01 Anajrim01 marked this pull request as ready for review September 12, 2024 22:12
@Shackless Shackless merged commit ea77a7b into ShipBit:main Sep 14, 2024
1 check passed
@Shackless
Copy link
Contributor

Awesome! Back in the day when I built this, there weren't many client-side libs out there to calculate tokens. The larger ones required a full node environment but it's nice to see that there are better options now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve token cost calculation (+performance)
2 participants