Deep Dive into OCR – Chapter 2: Development and Classification of OCR Technology
This article provides a comprehensive overview of OCR technology, detailing the evolution from traditional hand‑crafted methods to modern deep‑learning approaches, describing image preprocessing, text detection and recognition pipelines, summarizing classic machine‑learning algorithms, and presenting a practical OpenCV implementation with Python code.
Deep Dive into OCR – Chapter 2: Development and Classification of OCR Technology
After several months of preparation, the author launches a new series Deep Dive into OCR , which aims to cover OCR technology from its history, concepts, algorithms, papers, and datasets, providing a complete tutorial.
Article Directory Chapter 1: OCR Technology Introduction (link) Chapter 2: OCR Technology Development and Classification (this article)
OCR Technology Development Overview
Generally, OCR can be divided into traditional methods and deep‑learning methods . Traditional methods are limited by hand‑crafted features and complex pipelines, while deep‑learning OCR replaces manual steps with CNN models that automatically detect text regions and recognize characters with superior accuracy.
The author summarizes the development timeline in the following diagram:
1. Traditional OCR
Traditional OCR algorithms rely on image‑processing techniques (e.g., projection, dilation, rotation) and statistical machine‑learning to extract text from simple, high‑resolution documents with uniform backgrounds.
1.1 Technical Process
The workflow includes image preprocessing (grayscale, binarization, noise removal, skew correction), layout analysis, character segmentation, recognition, layout reconstruction, post‑processing, and proofreading.
1.1.1 Image Preprocessing
(1) Binarization
Image binarization converts pixel values to 0 or 255, producing a clear black‑and‑white image that reduces data dimensionality and suppresses noise, which is crucial for OCR accuracy.
(2) Skew Detection and Correction
Hough Transform is used to detect straight lines in the image, enabling the estimation of skew angles.
PCA‑based Method computes the principal component of foreground pixels to determine the dominant orientation.
1.1.2 Traditional Text Detection and Recognition
Traditional OCR separates text detection (locating text regions) and recognition (classifying characters). Detection methods include salient‑feature‑based and sliding‑window approaches.
Traditional detection struggles with complex scenes such as heavily distorted or blurry text.
1.2 Traditional Machine‑Learning OCR Methods
After locating text regions and correcting skew, characters are segmented and fed into feature extraction (hand‑crafted or CNN features) followed by a classification model. Post‑processing often uses statistical language models (e.g., HMM) for error correction.
1.2.1 Feature Extraction Methods
Structural Features : contour and region descriptors (e.g., Canny, HOG, Sobel).
Geometric Distribution Features : capture shape information via projection histograms, 2‑D histograms, and grid‑based methods.
Template Matching : computes similarity between a query image and a library of character templates.
1.2.2 Traditional Classification Methods
After feature extraction, characters are classified using various algorithms:
Support Vector Machine (SVM) : effective for small samples and high‑dimensional data.
Bayesian Classifier : predicts class probabilities using Bayes theorem.
K‑Nearest Neighbors (KNN) : simple, non‑parametric method based on majority voting of nearest samples.
Multilayer Perceptron (MLP) : feed‑forward neural network that handles non‑linear problems.
Neural Network Algorithms : either feed the raw pixel matrix directly or use extracted features as input.
2. Deep‑Learning OCR
With the rapid development of deep learning, OCR has shifted from hand‑crafted pipelines to end‑to‑end CNN‑based models that automatically learn visual features, greatly improving recognition performance.
2.1 Technical Pipeline
Image Preprocessing : grayscale, binarization, denoising, skew correction, normalization.
Text Detection : models such as CTPN, EAST, SegLink, TextBoxes, R2CNN, PixelLink, PSENet.
Text Recognition : models such as CRNN, Attention‑OCR.
Post‑Processing : language models, dictionaries, rules, and layout reconstruction.
2.2 Deep‑Learning Text Detection and Recognition
OCR algorithms can be two‑stage (separate detection and recognition) or end‑to‑end (single model handling both).
2.2.1 Deep‑Learning Text Detection
Detection models have evolved from regression‑based to segmentation‑based approaches, and can be categorized as top‑down or bottom‑up.
2.2.2 Deep‑Learning Text Recognition
The mainstream recognition pipeline includes image preprocessing, visual feature extraction, sequence modeling, and prediction.
Recognition methods are classified as:
CTC‑based (e.g., CRNN, Rosetta)
Attention‑based (e.g., RARE, DAN, PREN)
Transformer‑based (e.g., SRN, NRTR, Master, ABINet)
Rectification modules (e.g., RARE, ASTER, SAR)
Segmentation‑based (e.g., Text Scanner, Mask TextSpotter)
Algorithm Category
Main Idea
Key Papers
Traditional
Sliding window, character extraction, dynamic programming
-
CTC
Sequence‑to‑sequence alignment without explicit segmentation
CRNN, Rosetta
Attention
Focus on relevant regions for irregular text
RARE, DAN, PREN
Transformer
Self‑attention based modeling
SRN, NRTR, Master, ABINet
Rectification
Learn text boundaries and rectify to horizontal orientation
RARE, ASTER, SAR
Segmentation
Detect character regions then classify
Text Scanner, Mask TextSpotter
2.3 End‑to‑End Natural‑Scene Detection and Recognition
End‑to‑end OCR models jointly learn detection and recognition, sharing CNN features and achieving smaller model size and faster inference.
Two major categories exist:
Rule‑based text (straight or slightly tilted) – e.g., FOTS, TextSpotter.
Arbitrary‑shape text (curved, distorted) – e.g., Mask TextSpotter, ABCNet, PGNet, PAN++.
3. Practical Traditional OCR with OpenCV
import cv2
import numpy as np
import argparse
import imutils
from imutils import contours
import pytesseract
from PIL import Image
import os
def ShowImage(name, image):
cv2.imshow(name, image)
cv2.waitKey(0) # wait for any key
cv2.destroyAllWindows()
def order_points(pts):
# four points
rect = np.zeros((4, 2), dtype="float32")
# top‑left, top‑right, bottom‑right, bottom‑left
s = pts.sum(axis=1)
rect[0] = pts[np.argmin(s)]
rect[2] = pts[np.argmax(s)]
diff = np.diff(pts, axis=1)
rect[1] = pts[np.argmin(diff)]
rect[3] = pts[np.argmax(diff)]
return rect
def four_point_transform(image, pts):
rect = order_points(pts)
(tl, tr, br, bl) = rect
widthA = np.sqrt(((br[0] - bl[0])**2) + ((br[1] - bl[1])**2))
widthB = np.sqrt(((tr[0] - tl[0])**2) + ((tr[1] - tl[1])**2))
maxWidth = max(int(widthA), int(widthB))
heightA = np.sqrt(((tr[0] - br[0])**2) + ((tr[1] - br[1])**2))
heightB = np.sqrt(((tl[0] - bl[0])**2) + ((tl[1] - bl[1])**2))
maxHeight = max(int(heightA), int(heightB))
dst = np.array([
[0, 0],
[maxWidth - 1, 0],
[maxWidth - 1, maxHeight - 1],
[0, maxHeight - 1]
], dtype="float32")
M = cv2.getPerspectiveTransform(rect, dst)
warp = cv2.warpPerspective(image, M, (maxWidth, maxHeight))
return warp
def resize(image, width=None, height=None, inter=cv2.INTER_AREA):
dim = None
(h, w) = image.shape[:2]
if width is None and height is None:
return image
if width is None:
r = height / float(h)
dim = (int(w * r), height)
else:
r = width / float(w)
dim = (width, int(h * r))
resized = cv2.resize(image, dim, interpolation=inter)
return resized
image = cv2.imread('ocr1.png')
ratio = image.shape[0] / 500
orig = image.copy()
image = resize(image, height=500)
# preprocessing
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
gray = cv2.GaussianBlur(gray, (5, 5), 0)
edged = cv2.Canny(gray, 75, 200)
ShowImage('edged', edged)
# contour detection
cnts, _ = cv2.findContours(edged.copy(), cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
cnts = sorted(cnts, key=cv2.contourArea, reverse=True)[:5]
for c in cnts:
peri = cv2.arcLength(c, True)
approx = cv2.approxPolyDP(c, 0.02 * peri, True)
if len(approx) == 4:
screenCnt = approx
break
cv2.drawContours(image, [screenCnt], -1, (0, 0, 255), 2)
ShowImage('image', image)
warped = four_point_transform(orig, screenCnt.reshape(4, 2) * ratio)
ShowImage('warped', warped)
warped = cv2.cvtColor(warped, cv2.COLOR_BGR2GRAY)
ref = cv2.threshold(warped, 100, 255, cv2.THRESH_BINARY)[1]
ShowImage('binary', ref)
filename = "{}.png".format(os.getpid())
cv2.imwrite(filename, ref)
text = pytesseract.image_to_string(Image.open(filename))
print(text)
os.remove(filename)
ShowImage('image', ref)Resulting OCR outputs are shown in the following images:
If you find this article helpful, please consider following, liking, and bookmarking the public account.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.