-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Of two inverted top right texts one gets scanned double, the upper one disappears #3871
Comments
I tried |
AFAIK, fast was trained on inverted text and non-inverted text and on upright pages and upside down pages. |
I'm not convinced it's inversion related. I think it already comes from somewhere where segments are propagated into each other, probably searching underlines. If I run this statement wis-clear is still double, and print is still missing: tesseract --dpi 300 -l Latin 175789293-f39ddfdb-6f3e-4598-8d16-80a1f4a88b36.jpg outputwithoutinvert So textline inversion might be removed as a label. |
By the way, this one is compiled without legacy, so it's in the new parts |
|
I also no longer think that it is related to textline inversion as the issue also occurs in old versions like 4.0.0. My previous
The layout detection is mostly still old code. |
I've now pinpointed the disappearing upper boundingbox from |
This might be involved: B:28 R:1 -- Can't do isolated row stats. tesseract -c textord_restore_underlines=1 --dpi 300 -l Latin -c textord_noise_rejrows=0 -c textord_debug_block=28 -c textord_noise_debug=1 -c textord_debug_tabfind=1 -c textord_debug_bugs=1 -c textord_show_final_rows=1 -c tosp_debug_level=6 /home/rmast/175789293-f39ddfdb-6f3e-4598-8d16-80a1f4a88b36.jpg outputdebug85online |
We know that some parts are skipped at complex layout (table-like) images. Tesseract has just a basic document layout analysis. Do your own layout segmentation for all complicated document layouts and store it in uzn file/each segment OCR individually. Here is an example of amitdo's test image. tesseract inverted.png - --psm 4
UZN file inverted.uzn loaded.
> print
> wis - clear |
Please let us know if you find an open source automatic segmenter that generally and unattendedly does a better job than Tesseract itself. I guess that would be a hit.
Or could you say that x4 upscaling in general does a better job?
|
That's interesting! That makes focussing on the issue easier. I've run it with and it gave when the cleanup_blocks was commented out. With the cleanup blocks this was the debug-result (and no output). Vertical skew vector=(0,1) -c invert_threshold=0.5 does not help recognizing the block. |
./migneuzn ~/175789293-f39ddfdb-6f3e-4598-8d16-80a1f4a88b36.jpg > ~/175789293-f39ddfdb-6f3e-4598-8d16-80a1f4a88b36.uzn
So that's also a possibility to focus on the issue! Thanks for these hints! |
Unfortunately cutting out the picture with Paint recoded the jpeg, so it isn't representative. |
Convert the whole image to PNG first, and then do image processing. |
I saw some attempts to solve this problem with prepared templates (e.g. for invoices) base on known document source. With this approach you can skip some parts like logo, header, footer etc. to speed up OCR, or use custom OCR/postpossessing of amounts I heard there are some attempts to do image/document segmentation by machine learning, but I did not see any open source (working) solution.
In docs, there is link to test for optimal letter size. So scaling could help, but you need to know in advance original letter size to calculate scaling. In complicated layout with different fonts&sizes of course you need to first split image to uniform blocks... |
I tried EasyOCR as segmenter. Using the segments as UZN on the image or the inverted image doesn't make a difference. I still tend to dive into the error(s) despite the lack of testeffort when the error is solved.
|
I'm now on a track for finding the cause of the double 'wis - clear'. The second row of block 28 gives 4 words: A blob of the full row, "wis", "-" and "clear". The blob containing ">" is skipped when the space after it appears to end before the end of the full row in the first blob. The full row-blob is not inverted in CheckInverseFlagAndDirection within stepblob.cpp:222. The other outlines of this row are inverted. |
-c edges_use_new_outline_complexity=1 doesn't solve these issues. |
There appears to be something wrong with the decisionmaking around good and bad (rejected) blobs:
The parent of the lower row appears to be kept alive, while the children of the upper row are all rejected as well. Keeping the parent of the lower row alive makes the > rejected. |
Just killing the non-inverted parents in stepblob.cpp solves the issue for both lines:
However, the question is whether there are examples of parents that may not be killed. With what conditions should (parts of) parents be preserved, and should those parents be inverted if their children are inverted as well? Are there other paths leading to this !blob_is_good that I miss when I just kill everything as if they are parents of maintained children? |
I tried the effects of killing the parents on 5.1.0 with the full page using ocrmypdf. ocrmypdf --image-dpi 300 --pdfa-image-compression lossless -O0 ../rmast/175789293-f39ddfdb-6f3e-4598-8d16-80a1f4a88b36.jpg formulierhocrjpgmetpatch5.1.0.pdf For some reason the resulting selection from Adobe Acrobat Reader improves with this patch: The second column 'Waarom dit formulier?' can be selected separately with my patched version, while selecting it on the original 5.1.0- version tries to select the second column in parallell and pastes the lines intermixed. |
With 5.2.0 default settings the inverted Toelichting 2.1 is correctly read, however with none of the versions the bottom line with the ®-sign is complete. |
Don't do your tests with PDF as output. Different PDF viewers can present the same file differently. |
Yes, Zathura makes a mess of the selection, not clearly showing what lines are selected or not.
|
Please let us know if you find an open source automatic segmenter that generally and unattendedly does a better job than Tesseract itself. I guess that would be a hit.
Outlook voor Android downloaden<https://aka.ms/ghei36>
________________________________
From: zdenop ***@***.***>
Sent: Thursday, July 21, 2022 6:15:28 PM
To: tesseract-ocr/tesseract ***@***.***>
Cc: rmast ***@***.***>; Author ***@***.***>
Subject: Re: [tesseract-ocr/tesseract] Of two inverted top right texts one gets scanned double, the upper one disappears (Issue #3871)
We know that some parts are skipped at complex layout (table-like) images. Tesseract has just a basic document layout analysis.
Do your own layout segmentation for all complicated document layouts and store it in uzn file/each segment OCR individually.
Also, I suggest the following docs<https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md> (black text on white background).
Here is an example of amitdo's test image.
tesseract inverted.png - --psm 4
UZN file inverted.uzn loaded.
print
wis - clear
i3871_inverted.zip<https://github.com/tesseract-ocr/tesseract/files/9160801/i3871_inverted.zip>
—
Reply to this email directly, view it on GitHub<#3871 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAZPZ5WP4TMDHCBS3GKWHITVVFZSBANCNFSM532FCG2Q>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
An error should not be blurred with manipulating the source-image until someone looking at it approves the result. Errors should be examined and solved, aiming at a Tesseract that operates unattended. At least for the purpose of the image compression Merlijn Wajer wants to reach at the internet archive.
Outlook voor Android downloaden<https://aka.ms/ghei36>
________________________________
From: Amit D. ***@***.***>
Sent: Thursday, July 21, 2022 4:27:25 PM
To: tesseract-ocr/tesseract ***@***.***>
Cc: rmast ***@***.***>; Author ***@***.***>
Subject: Re: [tesseract-ocr/tesseract] Of two inverted top right texts one gets scanned double, the upper one disappears (Issue #3871)
After upscaling (4x)
[3871-ROI-x4]<https://user-images.githubusercontent.com/13571208/180238807-43dcbfcc-ab3b-4779-9ac1-d5ca23ad1d47.png>
output:
print
Wis - clear
—
Reply to this email directly, view it on GitHub<#3871 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAZPZ5TIOZ3KO6PBYEJZLULVVFM43ANCNFSM532FCG2Q>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
I'm investigating my issue earlier spotted in #3141 further.
In this picture above the text 'wis-clear' on the right, there is a text 'print'. This text print disappears completely and the text wis-clear has been read in twice.
Environment
Current Behavior:
Some inverted text on the top right disappears, other text gets scanned in twice.
There are two similar bounding boxes involved:
Expected Behavior:
Clearly readable text should be recognized without failure.
Suggested Fix:
The text was updated successfully, but these errors were encountered: