-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tesseract fails to read simple numbers #4285
Comments
OK, the page segmentation mode seems to be the issue here. Replacing |
Why not close the issue if it's resolved? |
Well, I think psm 8 should be able to handle this, too, no? |
Hi @embeh , what kind of image processing techniques did you use? |
A few simple opencv filters to crop, rotate, deskew the images and to erode some small pixel islands. I can dig up the exact commands if this helps? |
Try to increase the contrast between the numbers and the background to make them more distinct. This might help. No need for the commands, I was just interested in the processings you already did. |
I don't really understand the motivation. If, for the same pixels, psm 7 works fine but psm 8 does not - why would a change in the image processing make a difference? In addition, the contrast is as big as it can be: the background is pure white, the text is fully black, i.e. it is a binary image. Any grey you might see is only due to how github renders the image. |
What we have here is a line with several digits separated by spaces. IMO, there is no good reason to consider this line as one word. If it was 'H e l l o' then you could call it a word, but inside a text line, Tesseract will still consider any big enough horizontal white space as a word separator. |
tesseract 4.1.1 is too old and we don't support it. You said you get a better result with psm 7, but you didn't provide the output with this psm. |
OK. Unfortunately that seems to be the latest offered by the default Ubuntu repository (and pytesseract?).
--psm 7 produces the output "4428734" Both were run on the identical image file. So the result is not completely wrong, and it seems not to force the result to multiple words or such. It just messes up the "4" and the "8".
OK. These numbers come from an analog counter (think old car's mileage counter), so they are rather "monospaced". Don't get me wrong - I found a solution that works for me; now all I am trying is to provide feedback to help making this an even better piece of software... |
For psm 8 with the first image, let's say there is a place for improvement... Tesseract is very popular open source software. We get a lot of questions, bug reports and suggestions, but the team is tiny (4 people currently) and we're all volunteers. |
same problem... to complex library fo simpe tasks |
there is also a problem with numbers, for example 5 008/6 002 def extract_numbers(text): Delete all non-numeric characters except "/"cleaned_text = re.sub(r'[^0-9/]', ", text) Searching for numbers in XXX/YYYmatch format = re.search(r'(\d+)/(\d+)', cleaned_text) config='--psm 7 --oem 3 -c tessedit_char_whitelist=0123456789/' |
Current Behavior
I am using pytesseract (which calls
/usr/bin/tesseract
) to recognize numbers of a gas meter.Unfortunately, this very often fails to read most numbers and is very unreliable.
The actual command to get the number string from the image is
pytesseract.image_to_string(img, lang='eng', config='--dpi 70 --psm 8 -c tessedit_char_whitelist=,0123456789')
Here is an example image (after some image processing):
When running this through tesseract (as described above), I just get "2734"... :-(
Any ideas how to improve this, given that there never will be anything but numbers from 0-9 in the image...?
Expected Behavior
Correctly read the numbers. For the image example, this should be "4428734"
Suggested Fix
No response
tesseract -v
tesseract 4.1.1
leptonica-1.79.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
Found AVX2
Found AVX
Found FMA
Found SSE
Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
Operating System
No response
Other Operating System
Ubuntu 20
uname -a
Linux myhost 4.4.0-19041-Microsoft #4355-Microsoft Thu Apr 12 17:37:00 PST 2024 x86_64 x86_64 x86_64 GNU/Linux
Compiler
No response
CPU
No response
Virtualization / Containers
Ubuntu in WSL2
Other Information
No response
The text was updated successfully, but these errors were encountered: