Artificial Intelligence 11 min read

Unlock Real-Time Mobile OCR: Inside Ant’s xNN-OCR Engine and Its Tiny, Fast AI

Ant’s self‑developed xNN‑OCR demonstrates how advanced OCR can run offline on smartphones by combining GAN‑based data synthesis, lightweight ShuffleNet‑inspired detection, NAS‑optimized recognition, and aggressive model compression, delivering near‑real‑time accuracy for diverse mobile scenarios while preserving privacy and low cost.

Alibaba Terminal Technology

Dec 15, 2021

Unlock Real-Time Mobile OCR: Inside Ant’s xNN-OCR Engine and Its Tiny, Fast AI

As mobile devices become more powerful, running complex AI computations on phones has become a core focus for many companies, enabling a wide range of edge‑intelligent applications. This article uses the widely adopted Optical Character Recognition (OCR) technology as an example to introduce Ant Group’s self‑developed mobile OCR solution, xNN‑OCR.

Background

OCR is a long‑standing and widely used research area in computer vision. With the rise of deep learning, OCR capabilities have expanded dramatically. Compared with cloud‑based OCR, on‑device OCR can extract text offline, offering advantages for real‑time, privacy‑sensitive, and cost‑critical scenarios. However, modern OCR models often contain dozens of megabytes of parameters and hundreds of GFlops, making it challenging to run them on limited mobile resources. Ant integrated its proprietary edge inference engine xNN with extensive algorithmic optimizations to create a small, fast, and accurate OCR solution that has powered dozens of core business services since its 2018 launch.

xNN‑OCR Evolution

The development of an edge model follows several steps: data collection and annotation, network architecture design, training and tuning, edge porting, and deployment. xNN‑OCR has progressed through three stages: small‑character set, large‑character set, and heterogeneous‑computing models. The following sections describe the latest advances in data generation, network design, and model compression.

Data Generation

Data quality heavily influences OCR performance, especially for Chinese text with countless character combinations. To address data scarcity, Ant explored GAN‑based text generation. The pipeline uses three encoders to extract background, text, and font features, then performs font transfer and background blending to synthesize realistic images. In addition to standard adversarial and reconstruction losses, a recognition loss ensures the generated content is correctly recognized. A Cycle‑Path module further improves synthesis quality, allowing synthetic data at 10% of the original volume to achieve the same accuracy as using 100% real data.

Network Architecture

xNN‑OCR consists of three main components: text line detection, text line recognition, and structured output.

1. Text Detection

Text detection differs from generic object detection due to extreme aspect ratios and rotated boxes. Traditional anchor‑based detectors require many anchors, increasing computation. Ant designed a lightweight detector based on ShuffleNet, employing multi‑layer shuffle modules and a pixel‑wise dense prediction head that outputs class and box regression for each pixel. To handle small and long targets on mobile, training uses instance‑balancing and OHEM, while inference applies weighted‑fusion NMS, achieving significant gains in speed and accuracy.

2. Text Recognition

After detection, recognition uses a CRNN backbone. Ant applied Neural Architecture Search (NAS) tailored for text recognition to find an optimal lightweight backbone. The original CRNN head consumed over 50% of total computation due to a one‑hot Softmax classifier. By replacing it with a dense Hamming‑code classifier, head latency dropped by about 70% without sacrificing accuracy.

3. Text Structuring

Structured output converts OCR results into key‑value pairs, e.g., for ID cards. Traditional methods rely on hand‑crafted rules based on text position, which are costly to maintain. xNN‑OCR introduces an Instance detection algorithm that adds a text‑box class label during detection, allowing direct mapping of recognized text to its field, reducing inference time and improving accuracy.

Model Compression

To accelerate edge model development, Ant built the xNAS tool, extending standard NAS with hardware‑aware metrics such as latency and FLOPs. For OCR, NAS searched channel widths and layer depths, reducing computation by 70% while slightly improving accuracy.

Quantization, pruning, and especially integer‑only inference are critical for mobile performance. Ant’s qNAS algorithm integrates quantization‑aware training with NAS, achieving less than 1% accuracy loss while shrinking model size to roughly one‑quarter and cutting CPU inference time by about 50%.

Performance and Capability

xNN‑OCR now supports a broad range of OCR scenarios, from generic text recognition to specialized ID‑card extraction, delivering near‑real‑time performance on a Snapdragon 855 CPU (single‑thread). The following table summarizes latency and accuracy metrics.

Open Access

xNN‑OCR is integrated into Alipay for security risk control, document upload, and digital finance. It is exposed to external developers via a mini‑program plugin, Ant’s mPaaS product, and the Alibaba Cloud Vision Open Platform as an offline SDK.

Developers can refer to the Alipay mini‑program integration guide or contact the mPaaS team for SDK access.