Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR Produces Non-Existent Text from Bleed-Through Artifacts #596

Closed
iansmirlis opened this issue Mar 4, 2025 · 1 comment
Closed

OCR Produces Non-Existent Text from Bleed-Through Artifacts #596

iansmirlis opened this issue Mar 4, 2025 · 1 comment

Comments

@iansmirlis
Copy link

Not sure if this is a true issue but I thought it's worth investigating.

When processing scanned documents, the OCR model sometimes produces text that does not exist in the original image. Initially, this seemed like a hallucination issue, but after careful inspection, I noticed that the generated text (mostly numbers and random characters) corresponds to very faint artifacts from text appearing on the reverse side of the scanned page.

This issue seems to be caused by bleed-through, where slightly visible text from the back page is mistakenly recognized as actual foreground text. The issue gets worse, the larger the dpi.

I think the model is too sensitive to bleed-through. Which is no big deal, as the images can be preprocessed in this case or I could change the confidence threshold, but maybe some preprocessing can also be done by the pipeline, or train the ocr model to be less sensitive to such effects.

Thanks for the great project!


This is an example part of the page that produces nonsense:

Image

Produces:

ανδρείκελο «ομοίωμα ανθρώπου»

< αρχ. άνδρείκελον (ήδη τον 5ο αι. π.Χ. σε Πλάτωνα και Ξενοφώντα] < άνδρ(ο)- + -είκελον < επίθ. είκελος «όμοιος», για το οποίο βλ.λ. εικόνα.

ανδρείος -> ανδρας

ανδριαντας

< apx. άνδριάς, - άντος (ήδη μυκ. a-di-ri-ja-pi: *άνδριαφι, οργανική πληθ.) < άνδρίον, υποκορ. τού ανήρ, άνδρος: 1 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

11 16 21 11 -

大型 大发电影群 中国体育

MM #582KO X THE HATE

191:1 1818.

The state the complex of the results of the states

Color Colline

ανδρικός -> άνδρας

  • ανδρισμός -> άνδρας -> « » » » » « « « « «
  • ανδρώνω → άνδρας

ανε- στερητικό | | | | | | | | | | | | |

< μεσν. άνε-, που προέρχεται από επίθ. με άν- στερητ. όταν ακολουθούσε -ε- (π.χ. άν-έκδοτος, άνέλπιστος, άν-επίδεκτος), από όπου στη συνέχεια αυτονομήθηκε ως στερητ. μόρφημα (π.χ. μεσν. άνέγνωρος).

@tarun-menta
Copy link
Collaborator

Thanks for pointing this out! This is an issue with the new text detection model from the latest surya release. The model seems to be a little to sensitive towards the bleed through text. We'll patch this in the next release!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants