June 15th, 2018 (back)

Using an FFT to align text for OCR purposes

This is a repost from my previous blog.

At some point I had a hacker-level understanding of the Fourier Transform--that is, I was able to put together programs that exploit properties of the FFT that I had learned through casual reading and the occasional "ah-hah!" moments.

In this post I am going to provide a very crude example using the FFT to expose geometric properties of an image by simply looking at its spatial domain. More so, we are going to utilize these properties to rotate text that is nearly suitable for OCR into a reasonable orientation. We need to first simplify the problem a bit by specifying our domain of imagery. Let's assume that the text that we are dealing with has already been filtered and is fairly noiseless binarized data.

This image is oriented correctly, so we will use it to estimate the error of the result from our program. We need to introduce some error in the form of orientation. To create this error, I am going to add a +45 deg rotation to the original image. There is one note that I want to make about this image. If we were to strictly rotate the original image about its origin, then there would be missing pixels due to the artifacts of having missing data. In this case the external program (Gimp) just interpolated the pixels creating a nice and clean looking rotation.
First, let's look at magnitude of the FFT.

import sys
import numpy as np
from PIL import Image

binarized_text = ...

# Binarized (1-bit image)
data = np.array(Image.open(binarized_text))
fft = np.fft.fft2(data)

Lucky for us the dynamic range of the Fourier coefficients with a dominantly black image provides a means of displaying data without a logarithmic transform. However, we will still do this for sake of completeness and fun in a few steps.

What immediate characteristic jumps out from the image? You can see that the main values lie on a diagonal, indicating that the text lines in the input image are also on a diagonal. We can get a better idea of the diagonal by filtering the resulting FFT by clamping down the lower 25% of the signal to 0. Since we are treating 25% as a thresholding value, then it might need some finer tuning depending on the image quality and types of text that you may want to process. However, we really only care about those major peaks along the axis of rotation. An educated guess would most likely point to the corners of the diagonal to be the location of the peaks since we have the source image for reference in its spatial orientation. We mentioned in the previous paragraph that the dynamic range already creates a presentable image, but this is not often the case. Let's log-scale the image so we're dealing with something in uint8 range.

max_peak = np.max(np.abs(fft))

# Threshold the lower 25% of the peak
fft[fft < (max_peak * 0.25)] = 0

# Log-scale the data
abs_data = 1 + np.abs(fft)
c = 255.0 / np.log(1 + max_peak)
log_data = c * np.log(abs_data)

That practically eliminated all the noise. Now, we want to really concentrate on the remaining peaks. So, let's find the new max and only look at pixels greater than 90% of that max. This is another threshold, which means you may have to tune this parameter as well. At this point we have a lower and upper threshold for our function parameters.

# Find two points within 90% of the max peak of the scaled image
max_scaled_peak = np.max(log_data)

# Determine the angle of two high-peak points in the image
rows, cols = np.where(log_data > (max_scaled_peak * 0.90))

Great. Now we have a collection of rows and columns that match our criteria. Let's find the min and max extent of the data that exist between our rows and cols arrays. We will use these min and max values to compute the two furthest pixels. These pixels will define the orientation of the original image.

min_col, max_col = np.min(cols), np.max(cols)
min_row, max_row = np.min(rows), np.max(rows)
dy, dx = max_col - min_col, max_row - min_row
theta = np.arctan(dy / float(dx))

Now we have our theta which means that we can rotate the original image. Let's just crudely rotate the image pixel by pixel.

# Translate and scale the image by the value we found
width, height = data.shape
cx, cy = width / 2, height / 2
new_image = np.zeros(data.shape)
for x, row in enumerate(data):
    for y, value in enumerate(row):
        xp = cx + (x - cx) * cos_theta - (y - cy) * sin_theta
        yp = cy + (x - cx) * sin_theta + (y - cy) * cos_theta
        if xp < 0 or yp < 0 or xp > width or yp > height:
        new_image[xp, yp] = data[x, y]

Now, let's look at the result!

That was pretty contrived though. Let's look at some real world data. In this example we use slightly rotated text with a lot of data and attempt some basic thresholding values to get a decent rotation. It's not perfect, but it only required 30s of parameter tuning.