Artificial Intelligence 13 min read

Document OCR: From Computer Vision Fundamentals to Ctrip's Full-Text OCR Implementation

This article explains the evolution of optical character recognition, outlines the complete OCR processing pipeline—including image input, preprocessing, binarization, noise removal, tilt correction, layout analysis, character segmentation, recognition, and post‑processing—while showcasing Ctrip's real‑world OCR project, its architecture, accuracy metrics, and key computer‑vision techniques such as CNN, HSV, HOG, LBP, and Haar features.

Ctrip Technology

May 2, 2018

Document OCR: From Computer Vision Fundamentals to Ctrip's Full-Text OCR Implementation

Optical Character Recognition (OCR) is the process of analyzing image files to extract textual and layout information, typically involving steps such as image input, preprocessing (binarization, noise removal, tilt correction), layout analysis, character segmentation, recognition, layout restoration, and post‑processing.

The Ctrip document OCR project aims to recognize Chinese and English text on IDs, passports, train tickets, visas, etc., with milestones from 2016 (client‑side ID/passport recognition) to 2018 (real‑time APP recognition of multiple document types) and reported error rates below 0.5% for digits and around 1‑3% false‑acceptance for Chinese text.

Key knowledge includes computer‑vision basics, the HSV color model, grayscale imaging, and deep‑learning models based on convolutional neural networks (CNN) combined with RNN/CTC or attention mechanisms.

Image binarization techniques such as fixed‑threshold, adaptive‑threshold, OTSU, and pooling are discussed, followed by an illustration of the OCR architecture and implementation details, including detection (guided vs. unguided), rejection handling using histogram equalization, and client‑side or front‑end processing.

Detection leverages prior knowledge (e.g., face or document edges) and deep‑learning models; rejection handling uses histogram equalization and binary search algorithms, exemplified by the following code:

def binary_search(arr, start, end, hkey):
    if start > end:
        return -1
    mid = start + (end - start) / 2
    if arr[mid] > hkey:
        return binary_search(arr, start, mid - 1, hkey)
    if arr[mid] < hkey:
        return binary_search(arr, mid + 1, end, hkey)
    return mid

Text detection distinguishes guided (using priors) from unguided (pure attention‑based) approaches, and the system incorporates face‑recognition components from the open‑source SeetaFaceEngine.

Text recognition employs grayscale projection, binarization, down‑sampling, and feature weighting (HOG, LBP, Haar). The projection algorithm is shown below:

int* v = NULL; // vertical projection
int* h = NULL; // horizontal projection
CvScalar s, t; // matrix elements during projection
IplImage* pBinaryImg = NULL; // binarized image
IplImage* pVerticImg = NULL; // vertical projection image
IplImage* pHorizImg = NULL; // horizontal projection image
int x, y; // pixel coordinates
v = new int[pBinaryImg->width];
h = new int[pBinaryImg->height];
for(i=0;i<pBinaryImg->width;i++) v[i]=0;
for(i=0;i<pBinaryImg->height;i++) h[i]=0;
for(x=0;x<pBinaryImg->width;x++) {
    for(y=0;y<pBinaryImg->height;y++) {
        s = cvGet2D(pBinaryImg, y, x);
        if(s.val[0]==0) v[x]++; // count black pixels vertically
    }
}
for(y=0;y<pBinaryImg->height;y++) {
    for(x=0;x<pBinaryImg->width;x++) {
        s = cvGet2D(pBinaryImg, y, x);
        if(s.val[0]==0) h[y]++; // count black pixels horizontally
    }
}
// create projection images
pVerticImg = cvCreateImage(cvGetSize(pBinaryImg), 8, 1);
pHorizImg = cvCreateImage(cvGetSize(pBinaryImg), 8, 1);
cvZero(pVerticImg);
cvZero(pHorizImg);
for(x=0;x<pBinaryImg->width;x++) {
    for(y=0;y<v[x];y++) {
        t = cvGet2D(pVerticImg, y, x);
        t.val[0] = 255;
        cvSet2D(pVerticImg, y, x, t);
    }
}
for(y=0;y<pBinaryImg->height;y++) {
    for(x=0;x<h[y];x++) {
        t = cvGet2D(pHorizImg, y, x);
        t.val[0] = 255;
        cvSet2D(pHorizImg, y, x, t);
    }
}

Feature extraction details include HOG (Histogram of Oriented Gradients) for gradient‑based object detection, LBP (Local Binary Patterns) for texture description, and Haar‑like features for rapid face‑like structure detection.

Post‑processing logic adds validation steps such as ID number checks, passport number verification, Chinese surname and pronunciation validation, and other domain‑specific heuristics.

The article concludes with a list of reference materials covering OCR, deep learning, and related research papers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

CNN computer vision Image processing OCR

Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.