Comprehensive Overview of OCR: Types, Models, Pre‑training Techniques, and DIY Pipelines on ModelScope
This article provides a detailed introduction to OCR technology, covering its fundamental concepts, major categories (document, scene, and handwritten OCR), typical processing pipelines, a suite of open‑source models on ModelScope—including detection, recognition, and table OCR—and recent multimodal pre‑training methods such as VLDoc and VLPT.
OCR (Optical Character Recognition) is a crucial AI technology that converts visual text into machine‑readable information, enabling applications ranging from document digitization to scene understanding.
OCR Types : The field is broadly divided into three categories—Document OCR (DAR), Scene OCR (STR), and Handwritten OCR (HCR). DAR handles structured documents with tables, charts, and seals; STR focuses on natural scene text such as street signs and product labels; HCR deals with handwritten inputs like notes and signatures.
Typical OCR Pipeline : The workflow generally includes (1) image pre‑processing (classification, enhancement, correction), (2) layout analysis (detecting paragraphs, headers, tables, etc.), (3) text detection and recognition (including formulas and charts), and (4) semantic post‑processing (information extraction, table reconstruction, etc.).
ModelScope Open‑Source Models : The DAMO Academy has released a series of models on ModelScope covering the entire pipeline. Detection models include line‑level detectors (SegLink++, DBNet) and table structure recognizers. Recognition models feature ConvTransformer‑based general and handwritten recognizers, as well as classic CRNN models. Table OCR combines detection, text recognition, and cell‑level layout reconstruction.
DIY Pipelines : Users can download the detection and recognition models from ModelScope, launch a notebook (CPU or GPU), and chain the models to build custom pipelines for general, handwritten, or table OCR. Example code loads the two models, extracts polygonal regions from detection results, crops each region, and runs the recognizer on the cropped lines.
Document Pre‑training (VLDoc) : VLDoc series models address the limited interaction between visual and textual modalities in existing document pre‑training by aligning visual‑language features and incorporating layout‑aware objectives. They achieve SOTA results on benchmarks such as FUNSD, CORD, RVL‑CDIP, and DocVQA, and are deployed in DAMO’s IDP and self‑learning platforms.
Multimodal Text‑Detection Pre‑training (VLPT) : VLPT introduces three pre‑training tasks—masked language modeling, image‑text contrastive learning, and text‑existence judgment—to teach models the correspondence between image patches and textual tokens. The pretrained backbone can be fine‑tuned on detectors like EAST, PSENet, and DBNet across datasets (IC15, IC17, TotalText, CTW1500, TD500), improving robustness and accuracy.
All models and demo notebooks are publicly available on ModelScope, enabling developers and researchers to quickly experiment with state‑of‑the‑art OCR solutions.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.