How to Build Optical Character Recognition (OCR) in Python

Optical character recognition (OCR) is a tool that can recognize text in images. Here’s how to build an OCR engine in Python.

Written by Fahmi Nurfikri
Published on Nov. 02, 2022
A person using their phone's camera to recognize the characters in a book.
Image: Shutterstock / Built In
Brand Studio Logo

Optical character recognition (OCR) is a technology that recognizes text in images, such as scanned documents and photos. Perhaps you’ve taken a photo of a text just because you didn’t want to take notes or because taking a photo is faster than typing it. Fortunately, thanks to smartphones today, we can apply OCR so that we can copy the picture of text we took before without having to retype it. 

What Is Python Optical Character Recognition (OCR)?

Python OCR is a technology that recognizes and pulls out text in images like scanned documents and photos using Python. It can be completed using the open-source OCR engine Tesseract.

We can do this in Python using a few lines of code. One of the most common OCR tools that are used is the Tesseract. Tesseract is an optical character recognition engine for various operating systems. 

 

Python OCR Installation

Tesseract runs on Windows, macOS and Linux platforms. It supports Unicode (UTF-8) and more than 100 languages. In this article, we will start with the Tesseract OCR installation process, and test the extraction of text in images.

The first step is to install the Tesseract. In order to use the Tesseract library, we need to install it on our system. If you’re using Ubuntu, you can simply use apt-get to install Tesseract OCR:

sudo apt-get install Tesseract-ocr

For macOS users, we’ll be using Homebrew to install Tesseract.

brew install Tesseract

For Windows, please see the Tesseract documentation.

Let’s begin by getting pytesseract installed.

$ pip install pytesseract

More on Python: 5 Ways to Write More Pythonic Code

 

Python OCR Implementation

After installation is completed, let’s move forward by applying Tesseract with Python. First, we import the dependencies.

from PIL import Image
import pytesseract
import numpy as np

I will use a simple image to test the usage of the Tesseract.

The words "Tesseract sample" in large characters.
A sample image for Tesseract to convert into text. | Image: Fahmi Nufikri

Let’s load this image and convert it to text.

filename = 'image_01.png'
img1 = np.array(Image.open(filename))
text = pytesseract.image_to_string(img1)

Now, let’s see the result.

print(text)

And this is the result.

Result after running the OCR in Python.
Result after running the OCR in Python. | Screenshot: Fahmi Nufikri

The results obtained from the Tesseract are good enough for simple images. However, in the real world it is difficult to find images that are really simple, so I will add noise to test the performance of the Tesseract. 

The words "Tesseract sample" in large characters and with a distorted background.
Sample image with noise. | Image: Fahmi Nufikri

We’ll do the same process as before.

filename = 'image_02.png'
img2 = np.array(Image.open(filename))
text = pytesseract.image_to_string(img2)
print(text)

This is the result.

No result after trying to pull text from an image with noise.
No result after trying to pull text from an image with noise. | Screenshot: Fahmi Nufikri

The result is, nothing. This means that tesseract cannot read words in images that have noise.

Next we’ll try to use a little image processing to eliminate noise in the image. Here I will use the OpenCV library. In this experiment, I’m using normalization, thresholding and image blur.

import numpy as np
import cv2norm_img = np.zeros((img.shape[0], img.shape[1]))
img = cv2.normalize(img, norm_img, 0, 255, cv2.NORM_MINMAX)
img = cv2.threshold(img, 100, 255, cv2.THRESH_BINARY)[1]
img = cv2.GaussianBlur(img, (1, 1), 0)

The result will be like this:

The sample image with noise cleaned to reveal the text.
The sample image with noise cleaned to reveal the text. | Image: Fahmi Nufikri

Now that the image is clean enough, we will try again with the same process as before. And this is the result.

Result revealing that the OCR picked up the text.
Result revealing that the OCR picked up the text. | Screenshot: Fahmi Nufikri

As you can see, the results are in accordance with what we expect.

Video introducing the basics of how to use PyTesseract to extract text from images. | Video: ProgrammingKnowledge

More on Python: 11 Best Python IDEs and Code Editors Available

 

Text Localization and Detection in Python OCR

With Tesseract, we can also do text localization and detection from images. We will first enter the dependencies that we need.

from pytesseract import Output
import pytesseract
import cv2

I will use a simple image like the example above to test the usage of the Tesseract.

Sample image to run in the OCR.
Sample image to run in the OCR. | Image: Fahmi Nufikri

Now, let’s load this image and extract the data.

filename = 'image_01.png'
image = cv2.imread(filename)

This is different from what we did in the previous example. In the previous example we immediately changed the image into a string. In this example, we’ll convert the image into a dictionary.

results = pytesseract.image_to_data(image, 
output_type=Output.DICT)

The following results are the contents of the dictionary.

{
'level': [1, 2, 3, 4, 5, 5, 5],
'page_num': [1, 1, 1, 1, 1, 1, 1],
'block_num': [0, 1, 1, 1, 1, 1, 1],
'par_num': [0, 0, 1, 1, 1, 1, 1],
'line_num': [0, 0, 0, 1, 1, 1, 1],
'word_num': [0, 0, 0, 0, 1, 2, 3],
'left': [0, 26, 26, 26, 26, 110, 216],
'top': [0, 63, 63, 63, 63, 63, 63],
'width': [300, 249, 249, 249, 77, 100, 59],
'height': [150, 25, 25, 25, 25, 19, 19],
'conf': ['-1', '-1', '-1', '-1', 97, 96, 96],
'text': ['', '', '', '', 'Testing', 'Tesseract', 'OCR']
}

I will not explain the purpose of each value in the dictionary. Instead, we will use the left, top, width and height to draw a bounding box around the text along with the text itself. In addition, we will need a conf key to determine the boundary of the detected text.

Now, we will extract the bounding box coordinates of the text region from the current result, and we’ll specify the confidence value that we want. Here, I’ll use the value conf = 70. The code will look like this:

for i in range(0, len(results[“text”])):
   x = results[“left”][i]
   y = results[“top”][i]

   w = results[“width”][i]
   h = results[“height”][i]

   text = results[“text”][i]
   conf = int(results[“conf”][i])

   if conf > 70:
       text = "".join([c if ord(c) < 128 else "" for c in text]).strip()
       cv2.rectangle(image, (x, y), (x + w, y + h), (0, 255, 0), 2)
       cv2.putText(image, text, (x, y - 10), 
cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 200), 2)

Now that everything is set, we can display the results using this code.

cv2.imshow(image)

And this is the result.

Results with box coordinates around the text.
Results with box coordinates around the text. | Image: Fahmi Nufikri

Ultimately, the Tesseract is most suitable when building a document processing pipeline where images are scanned and processed. This works best for situations with high-resolution input, where foreground text is neatly segmented from the background.

For text localization and detection, there are several parameters that you can change, such as confident value limits. Or if you find it unattractive, you can change the thickness or color of the bounding box or text.

Explore Job Matches.