-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HOCR rendering compares unfavorably with tesseract PDF text layer #63
Comments
Thanks for the report, it will take me a bit of time to figure out what is up here. One thing that comes to mind is that it is possible both with Tesseract and archive-pdf-tools to generate a PDF that doesn't have images in it, that might make comparing them easier. (And you can decompress some of the compressed text layers potentially to make it easier to diff them) I checked the Tesseract git history just now, of |
Thanks for the reply! All my source materials are linked to from this ticket, if anyone wants to try to experiment with them further, such as generating text-only PDFs and then diffing text layers etc, or uploading to internet archive. It would take me a while to learn to use tools enough to do things like try to decompress and diff text layers, so it may take me a bit of time to get to that (or not) too!
Hm, now this makes me curious -- is running |
There are a few differences, mainly relating to how DPI is provided and handled. I don't have a mac, so I would not be able to text the mac preview specific things, but I can test with evince, firefox pdf.js and mupdf. |
Cool, yeah, I'd guess the underlying thing is just trying to figure out why the text layer differs at all, and if it can be made the same or more the same -- clearly the differences are leading MacOS Preview's heuristics to make different determinations of what constitutes a column of text, but even without access to MacOS Preview, the curious thing would be why/how the text element positioning differs in the first place! |
Oh and I will say, in none of my tests did I supply a "dpi" argument to any tool. I don't know if tesseract and recode_pdf will extract the dpi from the TIFF source, or what. I had enough variables going on in my testing that I decided to just ignore providing dpi in an argument and let the tools do what they will. It would be interesting if that would make a difference, if supplying a If you think that is plausible, I could try it? Just supply the known dpi of the original TIFF? To both tesseract and recode_pdf I guess? |
In the case of the simple image, it already has a DPI embedded, so supplying |
FYI: Tesseract also has a text only pdf option, pass |
Awesome, thanks! All three source TIFFs should have dpi metadata embedded. The second image is 600dpi, the first and third are 400dpi. |
(Also, while it's a different issue, the third example, "More complex graphical page" has pretty significant visible artifacts as a result of the MRC compression applied by recode_pdf. the other two do not) |
Note: I think our TIFFs should not have an alpha channel. And then the jp2's we make from them should not have an alpha layer. And it should not be a problem. But it has been a problem with some of our test data -- for instance, for whatever reason MacOS Preview seems to add an alpha channel to everything it saves. If you try to use this on something with an alpha channel, you'll get an error from img2pdf, a Maybe I'll wrap that in a better error. Note: This is slow at present. MUCH slower than previous PDF generation. We may look at speeding it up. We may look at caching individual PDF pages. |
It looks like pdfrenderer.py calculates a font size that is too large in some cases. There is some logic to deal with line slope and other factors that are beyond me, but I think that most calculations fall back on the default font size and this is where the rendering is consistent with Tesseract (I wonder if Tesseract also falls back on a default font size in most cases). I am using a simple pdfrenderer.patch for our major papers project. It will still lean toward missized fonts without specifiying DPI, and our pages tend to be produced with: |
Using recode_pdf (internetarchivepdf 1.5.2) and tesseract (5.3.0).
I have three examples single-pages, where I:
tesseract identifier.tiff identifier.tesseract -l eng pdf
tesseract identifier.tiff identifier.tesseract -l eng hocr
recode_pdf --bg-downsample 3 --from-imagestack identifier.tiff --hocr-file identifier.tesseract.hocr -o identifier.recode_pdf.pdf
I am finding that the text layout by the second process involving recode_pdf is not identical, and is inferior to, the text layout tesseract produces itself. I have put all my sample files on an S3 bucket for investigation, although I don't know if they will stay there forever.
Simple page
A simple, clear textual book page
if you select text in the PDF, the recode_pdf one has much smaller height than the tesseract one, the selection bar does not go all the way to top of ascenders like it does with tesseract one.
In this case, the recode_pdf version is perfectly usable, but it demonstrates not identical.
Somewhat more complicated page
This one is also a book page, but has a figure in the middle of the page interupting text, some background coloration, and the photography was not perfectly squared so text is somewhat diagonal.
in this one, the recode_pdf-generated textual data all seems to have double-height, making it very confusing to select text, and making highlights on search-within-the-pdf results also very confusing, a definite usability issue. I only have one line of text selected in these screenshots.
More complex graphical page
This is a graphical advertisement that only has a little bit of text on it, at various places and in various fonts.
This one is harder to explain/demonstrate. And for me only reproduces the problem in MacOS Preview.
If I open the recode_pdf PDF in MacOS Preview (with "Live Text" disabled, yup), and try to drag to select the line at "effective residual deposit", I can't select the whole line -- the layout of the text is leading the PDF reader to think there is a column there or something.
This one does not reproduce in Chrome PDF viewer, selection works okay there. But reproduces in MacOS preview, these screen recordings are from there. (I disable "Live Text" in my MacOS settings to ensure that what I'm seeing in Preview is embedded text data from PDF only, not OCR that MacOS Preview does itself on-the-fly under the branding "Live Text"!) I realize text-order in PDFs is a heuristic applied by the viewer, but this demonstrates that something in the layout was different -- and the layout from tesseract led to succesful heuristic in MacOS Preview, and the one from here did not.
What's going on?
I know you originally ported the HOCR rendering from tesseract. Brainstorming....
The text was updated successfully, but these errors were encountered: