Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diffenator word lists #11

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Diffenator word lists #11

wants to merge 3 commits into from

Conversation

RickyDaMa
Copy link
Collaborator

@RickyDaMa RickyDaMa commented Jan 24, 2025

Related to #3

Uses include-flate to minimise binary bloat, but this is gonna hurt regardless (and probably means we can't publish on crates.io if we want data baked in)

@RickyDaMa RickyDaMa requested a review from Hoolean January 24, 2025 17:37
@RickyDaMa RickyDaMa self-assigned this Jan 24, 2025
@m4rc1e
Copy link
Collaborator

m4rc1e commented Jan 24, 2025

I'm toying with the idea of including these wordlists in a separate repo since they're proving to be pretty valuable. I'd also like to include a small lib so we can generate sample texts using different params or count letter pairs etc.

@RickyDaMa
Copy link
Collaborator Author

RickyDaMa commented Jan 27, 2025

I'm toying with the idea of including these wordlists in a separate repo since they're proving to be pretty valuable. I'd also like to include a small lib so we can generate sample texts using different params or count letter pairs etc.

That would be super valuable for Python & Rust. The biggest problem I think you'd come across is how to distribute the data with the lib - both PyPI and crates.io don't want to be a CDN for big arbitrary. youseedee solves this at runtime (but has suffered from numerous data race issues), and particularly in Rust I'd be interested in a compile-time solution (which should be doable with a build script). Would happily contribute to that 😄

Also in Rust you have trade-offs to consider in terms of binary size - storing compressed strings and lazily decompressing at runtime (like is being done here and in diffenator3). If there was a dedicated crate then you could put more effort into using the 'best' compression scheme available and/or having an uncompressed option

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants