Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tesseract fails to read simple numbers #4285

Open
embeh opened this issue Jul 14, 2024 · 15 comments
Open

tesseract fails to read simple numbers #4285

embeh opened this issue Jul 14, 2024 · 15 comments
Labels

Comments

@embeh
Copy link

embeh commented Jul 14, 2024

Current Behavior

I am using pytesseract (which calls /usr/bin/tesseract) to recognize numbers of a gas meter.
Unfortunately, this very often fails to read most numbers and is very unreliable.

The actual command to get the number string from the image is
pytesseract.image_to_string(img, lang='eng', config='--dpi 70 --psm 8 -c tessedit_char_whitelist=,0123456789')

Here is an example image (after some image processing):
20240714-162351_08_ocr

When running this through tesseract (as described above), I just get "2734"... :-(

Any ideas how to improve this, given that there never will be anything but numbers from 0-9 in the image...?

Expected Behavior

Correctly read the numbers. For the image example, this should be "4428734"

Suggested Fix

No response

tesseract -v

tesseract 4.1.1
leptonica-1.79.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
Found AVX2
Found AVX
Found FMA
Found SSE
Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4

Operating System

No response

Other Operating System

Ubuntu 20

uname -a

Linux myhost 4.4.0-19041-Microsoft #4355-Microsoft Thu Apr 12 17:37:00 PST 2024 x86_64 x86_64 x86_64 GNU/Linux

Compiler

No response

CPU

No response

Virtualization / Containers

Ubuntu in WSL2

Other Information

No response

@embeh
Copy link
Author

embeh commented Jul 14, 2024

OK, the page segmentation mode seems to be the issue here.

Replacing --psm 8 with --psm 7 produces much better results (so does --psm 11 but none of the others) - but I have no idea why.
PSM 8 is advertised as "single word...", isn't that what we have here?

@DominicMukilan
Copy link

Why not close the issue if it's resolved?

@embeh
Copy link
Author

embeh commented Jul 16, 2024

Well, I think psm 8 should be able to handle this, too, no?

@v3ss0n
Copy link

v3ss0n commented Jul 18, 2024

It is still an issue . Tessearact LSTM engine have very hard time reconizing very simple numbers while PaddlePaddleOCR Recongnize well.

OCRCut

here is the result

7% 7% 23
6 6 8

psm 8 dosen't help

Legacy engine improve for numbers but its totally screwed on alphabets.

@uttaran-das
Copy link

Hi @embeh , what kind of image processing techniques did you use?

@embeh
Copy link
Author

embeh commented Aug 6, 2024

Hi @embeh , what kind of image processing techniques did you use?

A few simple opencv filters to crop, rotate, deskew the images and to erode some small pixel islands. I can dig up the exact commands if this helps?

@uttaran-das
Copy link

Hi @embeh , what kind of image processing techniques did you use?

A few simple opencv filters to crop, rotate, deskew the images and to erode some small pixel islands. I can dig up the exact commands if this helps?

Try to increase the contrast between the numbers and the background to make them more distinct. This might help. No need for the commands, I was just interested in the processings you already did.

@embeh
Copy link
Author

embeh commented Aug 6, 2024

Try to increase the contrast between the numbers and the background to make them more distinct. This might help. No need for the commands, I was just interested in the processings you already did.

I don't really understand the motivation. If, for the same pixels, psm 7 works fine but psm 8 does not - why would a change in the image processing make a difference?

In addition, the contrast is as big as it can be: the background is pure white, the text is fully black, i.e. it is a binary image. Any grey you might see is only due to how github renders the image.

@amitdo
Copy link
Collaborator

amitdo commented Aug 23, 2024

PSM 8 is advertised as "single word...", isn't that what we have here?

What we have here is a line with several digits separated by spaces. IMO, there is no good reason to consider this line as one word.

If it was 'H e l l o' then you could call it a word, but inside a text line, Tesseract will still consider any big enough horizontal white space as a word separator.

@amitdo
Copy link
Collaborator

amitdo commented Aug 23, 2024

tesseract 4.1.1 is too old and we don't support it.

You said you get a better result with psm 7, but you didn't provide the output with this psm.

@embeh
Copy link
Author

embeh commented Aug 23, 2024

tesseract 4.1.1 is too old and we don't support it.

OK. Unfortunately that seems to be the latest offered by the default Ubuntu repository (and pytesseract?).

You said you get a better result with psm 7, but you didn't provide the output with this psm.

--psm 7 produces the output "4428734"
--psm 8 produces the output "4L2B734"

Both were run on the identical image file.
You should be able to reproduce this by downloading the image above and run it through tesseract?

So the result is not completely wrong, and it seems not to force the result to multiple words or such. It just messes up the "4" and the "8".

What we have here is a line with several digits separated by spaces. IMO, there is no good reason to consider this line as one word.

OK. These numbers come from an analog counter (think old car's mileage counter), so they are rather "monospaced".
I certainly could use image processing to squeeze them together some more but what makes me wonder is that psm 7 simply does the job without such hacks.

Don't get me wrong - I found a solution that works for me; now all I am trying is to provide feedback to help making this an even better piece of software...

@embeh
Copy link
Author

embeh commented Aug 23, 2024

PSM 8 is advertised as "single word...", isn't that what we have here?

What we have here is a line with several digits separated by spaces. IMO, there is no good reason to consider this line as one word.

I just did a test and manually moved the individual digits closer to each other (without changing any of the black pixels) :
image

...and you are correct! Now I get this:

--psm 7: "4428734"
--psm 8: "4428734"

So both report the same correct numbers only because the spacing. Interesting!

@amitdo
Copy link
Collaborator

amitdo commented Aug 23, 2024

For psm 8 with the first image, let's say there is a place for improvement...

Tesseract is very popular open source software. We get a lot of questions, bug reports and suggestions, but the team is tiny (4 people currently) and we're all volunteers.

@amitdo amitdo added the digits label Aug 23, 2024
@AlexNemets
Copy link

same problem... to complex library fo simpe tasks

@EvilUbi
Copy link

EvilUbi commented Jan 12, 2025

there is also a problem with numbers, for example 5 008/6 002
every other time can not read a number after a space, I added exceptions.

def extract_numbers(text):

Delete all non-numeric characters except "/"

cleaned_text = re.sub(r'[^0-9/]', ", text)

Searching for numbers in XXX/YYY

match format = re.search(r'(\d+)/(\d+)', cleaned_text)
if match:
current = match.group(1)
max_val = match.group(2)
return current, max_val
return None, None

config='--psm 7 --oem 3 -c tessedit_char_whitelist=0123456789/'
what could be the cause of the problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants