Define scope of tooling and work to improve for that scope #47

MerlijnWajer · 2022-05-07T12:10:11Z

Right now the tooling naming is a bit confusing. The main tool is called "recode_pdf", but it really doesn't do PDF recoding, it does PDF creation and also inserts text layers, and performs MRC compression.

Since I am working on adding a tool to actually recode existing PDFs (MRC compressing them, and not doing anything else for starters), it might make sense to think about renaming the tool names, but also define what the tools ought to do.

I think there are a few scenarios:

Given a set of images (and hOCR results), create a (compressed) PDF - like what ocrmypdf does.
Given an input PDF with just one image per page, do what the above step does.
Given an uncompressed PDF, compress (recode) the PDF. Optional features here are to (1) insert a text layer (2) make the PDF PDF/A compatible

Can others think of other scenarios?

I guess there could be a tool that also incorporates calling Tesseract, but I think that should probably be out of scope of this particular project (I am interesting in building public tooling for this, just not in the scope of this repo)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define scope of tooling and work to improve for that scope #47

Define scope of tooling and work to improve for that scope #47

MerlijnWajer commented May 7, 2022

Define scope of tooling and work to improve for that scope #47

Define scope of tooling and work to improve for that scope #47

Comments

MerlijnWajer commented May 7, 2022