OCR Produces Non-Existent Text from Bleed-Through Artifacts #596

iansmirlis · 2025-03-04T21:37:44Z

Not sure if this is a true issue but I thought it's worth investigating.

When processing scanned documents, the OCR model sometimes produces text that does not exist in the original image. Initially, this seemed like a hallucination issue, but after careful inspection, I noticed that the generated text (mostly numbers and random characters) corresponds to very faint artifacts from text appearing on the reverse side of the scanned page.

This issue seems to be caused by bleed-through, where slightly visible text from the back page is mistakenly recognized as actual foreground text. The issue gets worse, the larger the dpi.

I think the model is too sensitive to bleed-through. Which is no big deal, as the images can be preprocessed in this case or I could change the confidence threshold, but maybe some preprocessing can also be done by the pipeline, or train the ocr model to be less sensitive to such effects.

Thanks for the great project!

This is an example part of the page that produces nonsense:

Produces:

ανδρείκελο «ομοίωμα ανθρώπου»

< αρχ. άνδρείκελον (ήδη τον 5ο αι. π.Χ. σε Πλάτωνα και Ξενοφώντα] < άνδρ(ο)- + -είκελον < επίθ. είκελος «όμοιος», για το οποίο βλ.λ. εικόνα.

ανδρείος -> ανδρας

ανδριαντας

< apx. άνδριάς, - άντος (ήδη μυκ. a-di-ri-ja-pi: *άνδριαφι, οργανική πληθ.) < άνδρίον, υποκορ. τού ανήρ, άνδρος: 1 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

11 16 21 11 -

大型大发电影群中国体育

MM #582KO X THE HATE

191:1 1818.

The state the complex of the results of the states

Color Colline

ανδρικός -> άνδρας

ανδρισμός -> άνδρας -> « » » » » « « « « «
ανδρώνω → άνδρας

ανε- στερητικό | | | | | | | | | | | | |

< μεσν. άνε-, που προέρχεται από επίθ. με άν- στερητ. όταν ακολουθούσε -ε- (π.χ. άν-έκδοτος, άνέλπιστος, άν-επίδεκτος), από όπου στη συνέχεια αυτονομήθηκε ως στερητ. μόρφημα (π.χ. μεσν. άνέγνωρος).

tarun-menta · 2025-03-05T20:59:25Z

Thanks for pointing this out! This is an issue with the new text detection model from the latest surya release. The model seems to be a little to sensitive towards the bleed through text. We'll patch this in the next release!

tarun-menta closed this as completed Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR Produces Non-Existent Text from Bleed-Through Artifacts #596

OCR Produces Non-Existent Text from Bleed-Through Artifacts #596

iansmirlis commented Mar 4, 2025

tarun-menta commented Mar 5, 2025

OCR Produces Non-Existent Text from Bleed-Through Artifacts #596

OCR Produces Non-Existent Text from Bleed-Through Artifacts #596

Comments

iansmirlis commented Mar 4, 2025

ανδρείκελο «ομοίωμα ανθρώπου»

ανδρείος -> ανδρας

ανδριαντας

ανδρικός -> άνδρας

ανε- στερητικό | | | | | | | | | | | | |

tarun-menta commented Mar 5, 2025