Optical character recognition (OCR) is a technology that recognizes text in images, such as scanned documents and photos. Perhaps you’ve taken a photo of a text just because you didn’t want to take notes or because taking a photo is faster than typing it. Fortunately, thanks to smartphones today, we can apply OCR so that we can copy the picture of text we took before without having to retype it.
What Is Python Optical Character Recognition (OCR)?
We can do this in Python using a few lines of code. One of the most common OCR tools that are used is the Tesseract. Tesseract is an optical character recognition engine for various operating systems.
Python OCR Installation
Tesseract runs on Windows, macOS and Linux platforms. It supports Unicode (UTF-8) and more than 100 languages. In this article, we will start with the Tesseract OCR installation process, and test the extraction of text in images.
The first step is to install the Tesseract. In order to use the Tesseract library, we need to install it on our system. If you’re using Ubuntu, you can simply use apt-get
to install Tesseract OCR:
sudo apt-get install Tesseract-ocr
For macOS users, we’ll be using Homebrew to install Tesseract.
brew install Tesseract
For Windows, please see the Tesseract documentation.
Let’s begin by getting pytesseract installed.
$ pip install pytesseract
Python OCR Implementation
After installation is completed, let’s move forward by applying Tesseract with Python. First, we import the dependencies.
from PIL import Image
import pytesseract
import numpy as np
I will use a simple image to test the usage of the Tesseract.
Let’s load this image and convert it to text.
filename = 'image_01.png'
img1 = np.array(Image.open(filename))
text = pytesseract.image_to_string(img1)
Now, let’s see the result.
print(text)
And this is the result.
The results obtained from the Tesseract are good enough for simple images. However, in the real world it is difficult to find images that are really simple, so I will add noise to test the performance of the Tesseract.
We’ll do the same process as before.
filename = 'image_02.png'
img2 = np.array(Image.open(filename))
text = pytesseract.image_to_string(img2)
print(text)
This is the result.
The result is, nothing. This means that tesseract cannot read words in images that have noise.
Next we’ll try to use a little image processing to eliminate noise in the image. Here I will use the OpenCV library. In this experiment, I’m using normalization, thresholding and image blur.
import numpy as np
import cv2norm_img = np.zeros((img.shape[0], img.shape[1]))
img = cv2.normalize(img, norm_img, 0, 255, cv2.NORM_MINMAX)
img = cv2.threshold(img, 100, 255, cv2.THRESH_BINARY)[1]
img = cv2.GaussianBlur(img, (1, 1), 0)
The result will be like this:
Now that the image is clean enough, we will try again with the same process as before. And this is the result.
As you can see, the results are in accordance with what we expect.
Text Localization and Detection in Python OCR
With Tesseract, we can also do text localization and detection from images. We will first enter the dependencies that we need.
from pytesseract import Output
import pytesseract
import cv2
I will use a simple image like the example above to test the usage of the Tesseract.
Now, let’s load this image and extract the data.
filename = 'image_01.png'
image = cv2.imread(filename)
This is different from what we did in the previous example. In the previous example we immediately changed the image into a string. In this example, we’ll convert the image into a dictionary.
results = pytesseract.image_to_data(image,
output_type=Output.DICT)
The following results are the contents of the dictionary.
{
'level': [1, 2, 3, 4, 5, 5, 5],
'page_num': [1, 1, 1, 1, 1, 1, 1],
'block_num': [0, 1, 1, 1, 1, 1, 1],
'par_num': [0, 0, 1, 1, 1, 1, 1],
'line_num': [0, 0, 0, 1, 1, 1, 1],
'word_num': [0, 0, 0, 0, 1, 2, 3],
'left': [0, 26, 26, 26, 26, 110, 216],
'top': [0, 63, 63, 63, 63, 63, 63],
'width': [300, 249, 249, 249, 77, 100, 59],
'height': [150, 25, 25, 25, 25, 19, 19],
'conf': ['-1', '-1', '-1', '-1', 97, 96, 96],
'text': ['', '', '', '', 'Testing', 'Tesseract', 'OCR']
}
I will not explain the purpose of each value in the dictionary. Instead, we will use the left, top, width and height to draw a bounding box around the text along with the text itself. In addition, we will need a conf
key to determine the boundary of the detected text.
Now, we will extract the bounding box coordinates of the text region from the current result, and we’ll specify the confidence value that we want. Here, I’ll use the value conf
= 70. The code will look like this:
for i in range(0, len(results[“text”])):
x = results[“left”][i]
y = results[“top”][i]
w = results[“width”][i]
h = results[“height”][i]
text = results[“text”][i]
conf = int(results[“conf”][i])
if conf > 70:
text = "".join([c if ord(c) < 128 else "" for c in text]).strip()
cv2.rectangle(image, (x, y), (x + w, y + h), (0, 255, 0), 2)
cv2.putText(image, text, (x, y - 10),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 200), 2)
Now that everything is set, we can display the results using this code.
cv2.imshow(image)
And this is the result.
Ultimately, the Tesseract is most suitable when building a document processing pipeline where images are scanned and processed. This works best for situations with high-resolution input, where foreground text is neatly segmented from the background.
For text localization and detection, there are several parameters that you can change, such as confident value limits. Or if you find it unattractive, you can change the thickness or color of the bounding box or text.