Artificial Intelligence 19 min read

How Amap’s Scene Text Recognition Powers Accurate Maps: Evolution and Future Challenges

This article explains how Amap leverages scene text recognition to automate map data production, detailing the evolution from traditional image algorithms to deep‑learning models, the current detection and recognition framework, performance results, and future research directions for handling blur, data scarcity, and semantic understanding.

Alibaba Cloud Developer

Jul 30, 2020

How Amap’s Scene Text Recognition Powers Accurate Maps: Evolution and Future Challenges

Background

As a national‑level software with over 100 million daily active users, Amap provides massive query, positioning, and navigation services. The richness and accuracy of map data determine user experience. Traditional map data collection relied on manual editing of field‑collected data, resulting in slow updates and high costs. To address this, Amap adopted image‑recognition technology to directly extract map elements from collected images, enabling automated production of POI (Point of Interest) and road data.

Text Recognition Technology Evolution and Practice

STR algorithm development timeline

Scene Text Recognition (STR) development can be divided into two stages around 2012: the traditional image‑algorithm stage and the deep‑learning stage.

Traditional image algorithms

Image preprocessing: text region localization, correction, character segmentation using connected‑component analysis, MSER, affine transformation, binarization, projection analysis, etc.

Text recognition: handcrafted features (e.g., HOG) or CNN‑extracted features classified by SVM or similar classifiers.

Post‑processing: rule‑based and language‑model corrections.

These methods work well in simple scenes but require extensive parameter tuning for each scenario and struggle with complex, diverse environments.

Deep learning algorithms

After 2012, deep learning became dominant in computer vision, leading to two main STR solutions: a two‑stage pipeline (text line detection + recognition) and end‑to‑end models that jointly perform detection and recognition.

Two‑stage text recognition

The pipeline first detects text lines using regression‑based, segmentation‑based, or hybrid methods, then recognizes the content via CTC‑based or attention‑based sequence models.

End‑to‑end text recognition

A single model simultaneously handles detection and recognition, improving real‑time performance and allowing joint training to boost both tasks.

Text Recognition Framework

Amap’s current framework consists of three modules: text line detection, single‑character detection & recognition, and sequence recognition. The detection module predicts text masks and line positions, using DCN for directional features, an enlarged mask branch, and ASPP to improve segmentation accuracy. Detected masks are converted to minimum bounding polygons for downstream processing.

Text line detection

The detector handles varied scales, orientations, and quality of natural‑scene text. It integrates DCN for directional features, expands the mask branch, and adds an ASPP module. Online data augmentation (rotation, flipping, mixup) improves generalization. Detection results are shown below.

Evaluation on public benchmarks (ICDAR2013, ICDAR2017‑MLT, ICDAR2019‑ReCTS) demonstrates superior performance.

Text recognition

Business requirements demand both high completeness of text line recognition and a subset with >99% confidence. Two metrics are defined:

Text line full‑match rate: proportion of lines correctly recognized with correct order.

High‑confidence proportion: proportion of lines with confidence >99%.

Two complementary approaches are used:

Single‑character detection & recognition

Based on Faster R-CNN for detection and a SENet‑based recognizer supporting over 7,000 Chinese/English characters. Optimizations include identity‑mapping, MobileNetV2‑style skip connections, and extensive data augmentation, achieving second place in ICDAR2019‑ReCTS.

Sequence recognition

Modern sequence recognizers (e.g., Aster, DTRT) perform text region correction, feature extraction, sequence encoding, and decoding. A TPS‑Inception‑BiLSTM‑Attention architecture is employed to handle multi‑directional text, correcting perspective distortion and feeding the result into a CNN‑BiLSTM‑Attention pipeline.

Sample mining & synthesis

To address rare characters and low‑frequency words in POI and road signs, Amap combines real‑sample mining (identifying images containing uncommon characters for manual annotation) with synthetic data generation via image rendering. Mixing real and synthetic samples significantly improves recognition performance.

Future Development and Challenges

Key challenges include data scarcity for diverse Chinese characters, handling blurred images, and improving model efficiency for edge deployment. Research directions cover data augmentation (AutoAugment, style‑transfer synthesis), blur‑text super‑resolution (TextSR, GAN‑based methods), semantic understanding by integrating language models (e.g., SEED), and lightweight on‑device OCR frameworks.

Data side

Automatic data expansion via augmentation strategies (AutoAugment) or synthetic generation (e.g., Alibaba’s SwapText) is essential when manual labeling is limited.

Model side

Blurred text detection benefits from super‑resolution networks integrated into detection pipelines, while semantic priors from NLP improve recognition accuracy. End‑to‑end lightweight models aim to reduce cloud bandwidth and server load for edge deployment.

References 1. Liao M et al. Textboxes++: A single‑shot oriented scene text detector. IEEE Transactions on Image Processing, 2018. 2. Lyu P et al. Mask TextSpotter: An End‑to‑End Trainable Neural Network for Spotting Text with Arbitrary Shapes, 2018. 3. Zhou X et al. EAST: An Efficient and Accurate Scene Text Detector, 2017. 4. Shi B et al. An End‑to‑End Trainable Neural Network for Image‑Based Sequence Recognition and Its Application to Scene Text Recognition, IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017. 5. Wojna Z et al. Attention‑Based Extraction of Structured Information from Street View Imagery, ICDAR 2017. 6. Li H et al. Towards End‑to‑End Text Spotting with Convolutional Recurrent Neural Networks, 2017.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

computer vision deep learning OCR Amap map data automation scene text recognition

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.