You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Not sure if this is a true issue but I thought it's worth investigating.
When processing scanned documents, the OCR model sometimes produces text that does not exist in the original image. Initially, this seemed like a hallucination issue, but after careful inspection, I noticed that the generated text (mostly numbers and random characters) corresponds to very faint artifacts from text appearing on the reverse side of the scanned page.
This issue seems to be caused by bleed-through, where slightly visible text from the back page is mistakenly recognized as actual foreground text. The issue gets worse, the larger the dpi.
I think the model is too sensitive to bleed-through. Which is no big deal, as the images can be preprocessed in this case or I could change the confidence threshold, but maybe some preprocessing can also be done by the pipeline, or train the ocr model to be less sensitive to such effects.
Thanks for the great project!
This is an example part of the page that produces nonsense:
Produces:
ανδρείκελο «ομοίωμα ανθρώπου»
< αρχ. άνδρείκελον (ήδη τον 5ο αι. π.Χ. σε Πλάτωνα και Ξενοφώντα] < άνδρ(ο)- + -είκελον < επίθ. είκελος «όμοιος», για το οποίο βλ.λ. εικόνα.
The state the complex of the results of the states
Color Colline
ανδρικός -> άνδρας
ανδρισμός -> άνδρας -> « » » » » « « « « «
ανδρώνω → άνδρας
ανε- στερητικό | | | | | | | | | | | | |
< μεσν. άνε-, που προέρχεται από επίθ. με άν- στερητ. όταν ακολουθούσε -ε- (π.χ. άν-έκδοτος, άνέλπιστος, άν-επίδεκτος), από όπου στη συνέχεια αυτονομήθηκε ως στερητ. μόρφημα (π.χ. μεσν. άνέγνωρος).
The text was updated successfully, but these errors were encountered:
Thanks for pointing this out! This is an issue with the new text detection model from the latest surya release. The model seems to be a little to sensitive towards the bleed through text. We'll patch this in the next release!
Not sure if this is a true issue but I thought it's worth investigating.
When processing scanned documents, the OCR model sometimes produces text that does not exist in the original image. Initially, this seemed like a hallucination issue, but after careful inspection, I noticed that the generated text (mostly numbers and random characters) corresponds to very faint artifacts from text appearing on the reverse side of the scanned page.
This issue seems to be caused by bleed-through, where slightly visible text from the back page is mistakenly recognized as actual foreground text. The issue gets worse, the larger the dpi.
I think the model is too sensitive to bleed-through. Which is no big deal, as the images can be preprocessed in this case or I could change the confidence threshold, but maybe some preprocessing can also be done by the pipeline, or train the ocr model to be less sensitive to such effects.
Thanks for the great project!
This is an example part of the page that produces nonsense:
Produces:
ανδρείκελο «ομοίωμα ανθρώπου»
< αρχ. άνδρείκελον (ήδη τον 5ο αι. π.Χ. σε Πλάτωνα και Ξενοφώντα] < άνδρ(ο)- + -είκελον < επίθ. είκελος «όμοιος», για το οποίο βλ.λ. εικόνα.
ανδρείος -> ανδρας
ανδριαντας
< apx. άνδριάς, - άντος (ήδη μυκ. a-di-ri-ja-pi: *άνδριαφι, οργανική πληθ.) < άνδρίον, υποκορ. τού ανήρ, άνδρος: 1 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
11 16 21 11 -
大型 大发电影群 中国体育
MM #582KO X THE HATE
191:1 1818.
The state the complex of the results of the states
Color Colline
ανδρικός -> άνδρας
ανε- στερητικό | | | | | | | | | | | | |
< μεσν. άνε-, που προέρχεται από επίθ. με άν- στερητ. όταν ακολουθούσε -ε- (π.χ. άν-έκδοτος, άνέλπιστος, άν-επίδεκτος), από όπου στη συνέχεια αυτονομήθηκε ως στερητ. μόρφημα (π.χ. μεσν. άνέγνωρος).
The text was updated successfully, but these errors were encountered: