tesseract Optical Character Recognition and GIMP
Posted by CPUFreak91 on April 9th, 2008 filed in Project Difficulty: Easy, Time Spent: Less than 30 minutesI discovered tesseract when looking for an alternative Optical Character Recognition program to use in SANE. Unfortunately tesseract can’t be used by SANE unless there’s the possibility of setting a bunch of different .TIFF save settings. I decided to try my luck at images off the ‘net that had text in them. I took some screenshots of a few images with clean-ish white backgrounds, saved them as uncompressed TIFF and ran tesseract on them… nothing. So I searched the net. A Linux Journal article showed me all the steps one has to take to get a TIFF image to work in tesseract.
First, you must go to Tools→Color Tools→Threshold and change the image’s threshold until it looks as clear as you can make it.
Second, convert the image to Indexed mode with Image→Mode→Indexed and select the black and white (1-bit) palette.
Third, Remove the Alpha Channel by going to Layer→Transparency→Remove Alpha Channel
Fourth, save the image as .TIFF without any compression.
Then you can run tesseract on the .tiff image and, if the image isn’t too cluttered, you’ll see a 90%+ accuracy in the converted image!
Leave a Comment
You must be logged in to post a comment.