Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define scope of tooling and work to improve for that scope #47

Open
MerlijnWajer opened this issue May 7, 2022 · 0 comments
Open

Define scope of tooling and work to improve for that scope #47

MerlijnWajer opened this issue May 7, 2022 · 0 comments

Comments

@MerlijnWajer
Copy link
Collaborator

Right now the tooling naming is a bit confusing. The main tool is called "recode_pdf", but it really doesn't do PDF recoding, it does PDF creation and also inserts text layers, and performs MRC compression.

Since I am working on adding a tool to actually recode existing PDFs (MRC compressing them, and not doing anything else for starters), it might make sense to think about renaming the tool names, but also define what the tools ought to do.

I think there are a few scenarios:

  • Given a set of images (and hOCR results), create a (compressed) PDF - like what ocrmypdf does.
  • Given an input PDF with just one image per page, do what the above step does.
  • Given an uncompressed PDF, compress (recode) the PDF. Optional features here are to (1) insert a text layer (2) make the PDF PDF/A compatible

Can others think of other scenarios?

I guess there could be a tool that also incorporates calling Tesseract, but I think that should probably be out of scope of this particular project (I am interesting in building public tooling for this, just not in the scope of this repo)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant