tesseract Optical Character Recognition and GIMP

Posted by CPUFreak91 on April 9th, 2008 filed in Project Difficulty: Easy, Time Spent: Less than 30 minutes

I discovered tesseract when looking for an alternative Optical Character Recognition program to use in SANE. Unfortunately tesseract can’t be used by SANE unless there’s the possibility of setting a bunch of different .TIFF save settings. I decided to try my luck at images off the ‘net that had text in them. I took some screenshots of a few images with clean-ish white backgrounds, saved them as uncompressed TIFF and ran tesseract on them… nothing. So I searched the net. A Linux Journal article showed me all the steps one has to take to get a TIFF image to work in tesseract.

First, you must go to Tools→Color Tools→Threshold and change the image’s threshold until it looks as clear as you can make it.
Second, convert the image to Indexed mode with Image→Mode→Indexed and select the black and white (1-bit) palette.
Third, Remove the Alpha Channel by going to Layer→Transparency→Remove Alpha Channel
Fourth, save the image as .TIFF without any compression.

Then you can run tesseract on the .tiff image and, if the image isn’t too cluttered, you’ll see a 90%+ accuracy in the converted image!

Leave a Comment

You must be logged in to post a comment.