Document OCR: From Computer Vision Fundamentals to Ctrip's Full-Text OCR Implementation

This article explains the evolution of optical character recognition, outlines the complete OCR processing pipeline—including image input, preprocessing, binarization, noise removal, tilt correction, layout analysis, character segmentation, recognition, and post‑processing—while showcasing Ctrip's real‑world OCR project, its architecture, accuracy metrics, and key computer‑vision techniques such as CNN, HSV, HOG, LBP, and Haar features.

Ctrip Technology
Ctrip Technology
Ctrip Technology
Document OCR: From Computer Vision Fundamentals to Ctrip's Full-Text OCR Implementation

Optical Character Recognition (OCR) is the process of analyzing image files to extract textual and layout information, typically involving steps such as image input, preprocessing (binarization, noise removal, tilt correction), layout analysis, character segmentation, recognition, layout restoration, and post‑processing.

The Ctrip document OCR project aims to recognize Chinese and English text on IDs, passports, train tickets, visas, etc., with milestones from 2016 (client‑side ID/passport recognition) to 2018 (real‑time APP recognition of multiple document types) and reported error rates below 0.5% for digits and around 1‑3% false‑acceptance for Chinese text.

Key knowledge includes computer‑vision basics, the HSV color model, grayscale imaging, and deep‑learning models based on convolutional neural networks (CNN) combined with RNN/CTC or attention mechanisms.

Image binarization techniques such as fixed‑threshold, adaptive‑threshold, OTSU, and pooling are discussed, followed by an illustration of the OCR architecture and implementation details, including detection (guided vs. unguided), rejection handling using histogram equalization, and client‑side or front‑end processing.

Detection leverages prior knowledge (e.g., face or document edges) and deep‑learning models; rejection handling uses histogram equalization and binary search algorithms, exemplified by the following code:

def binary_search(arr, start, end, hkey):
    if start > end:
        return -1
    mid = start + (end - start) / 2
    if arr[mid] > hkey:
        return binary_search(arr, start, mid - 1, hkey)
    if arr[mid] < hkey:
        return binary_search(arr, mid + 1, end, hkey)
    return mid

Text detection distinguishes guided (using priors) from unguided (pure attention‑based) approaches, and the system incorporates face‑recognition components from the open‑source SeetaFaceEngine.

Text recognition employs grayscale projection, binarization, down‑sampling, and feature weighting (HOG, LBP, Haar). The projection algorithm is shown below:

int* v = NULL; // vertical projection
int* h = NULL; // horizontal projection
CvScalar s, t; // matrix elements during projection
IplImage* pBinaryImg = NULL; // binarized image
IplImage* pVerticImg = NULL; // vertical projection image
IplImage* pHorizImg = NULL; // horizontal projection image
int x, y; // pixel coordinates
v = new int[pBinaryImg->width];
h = new int[pBinaryImg->height];
for(i=0;i<pBinaryImg->width;i++) v[i]=0;
for(i=0;i<pBinaryImg->height;i++) h[i]=0;
for(x=0;x<pBinaryImg->width;x++) {
    for(y=0;y<pBinaryImg->height;y++) {
        s = cvGet2D(pBinaryImg, y, x);
        if(s.val[0]==0) v[x]++; // count black pixels vertically
    }
}
for(y=0;y<pBinaryImg->height;y++) {
    for(x=0;x<pBinaryImg->width;x++) {
        s = cvGet2D(pBinaryImg, y, x);
        if(s.val[0]==0) h[y]++; // count black pixels horizontally
    }
}
// create projection images
pVerticImg = cvCreateImage(cvGetSize(pBinaryImg), 8, 1);
pHorizImg = cvCreateImage(cvGetSize(pBinaryImg), 8, 1);
cvZero(pVerticImg);
cvZero(pHorizImg);
for(x=0;x<pBinaryImg->width;x++) {
    for(y=0;y<v[x];y++) {
        t = cvGet2D(pVerticImg, y, x);
        t.val[0] = 255;
        cvSet2D(pVerticImg, y, x, t);
    }
}
for(y=0;y<pBinaryImg->height;y++) {
    for(x=0;x<h[y];x++) {
        t = cvGet2D(pHorizImg, y, x);
        t.val[0] = 255;
        cvSet2D(pHorizImg, y, x, t);
    }
}

Feature extraction details include HOG (Histogram of Oriented Gradients) for gradient‑based object detection, LBP (Local Binary Patterns) for texture description, and Haar‑like features for rapid face‑like structure detection.

Post‑processing logic adds validation steps such as ID number checks, passport number verification, Chinese surname and pronunciation validation, and other domain‑specific heuristics.

The article concludes with a list of reference materials covering OCR, deep learning, and related research papers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

CNNComputer VisionImage ProcessingOCR
Ctrip Technology
Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.