Artificial Intelligence 18 min read

Evolution and Practice of Scene Text Recognition Technology in Amap Map Data Production

Amap uses advanced scene text recognition combining detection and recognition modules, deep learning, data synthesis, and result fusion to automate map data production, achieving state-of-the-art performance and automating the majority of POI and road updates, significantly reducing labor costs.

Amap Tech
Amap Tech
Amap Tech
Evolution and Practice of Scene Text Recognition Technology in Amap Map Data Production

Background

Amap, a national‑level navigation app with over 100 million daily active users, relies on rich and accurate map data to deliver a smooth user experience. Traditional map data collection involves manual field work and editing, which is slow and costly. To automate data production, Amap uses image‑recognition techniques to extract map elements directly from massive image collections, especially focusing on scene text (signboards, logos, POI names, road signs).

Challenges of Scene Text Recognition

Rich variety of fonts, artistic styles, and layouts.

Complex backgrounds with occlusion, uneven lighting, and low‑quality images from crowdsourced devices.

Diverse image sources with varying resolution, blur, tilt, and focus problems.

These factors make it difficult to achieve a solution that is comprehensive, accurate, and fast.

Evolution of Text Recognition Technology

1. Traditional Image Algorithms (pre‑2012)

Three‑stage pipeline: image preprocessing (region localization, rectification, character segmentation), character recognition (hand‑crafted features such as HOG or CNN‑extracted features fed to classifiers like SVM), and post‑processing (rules, language models). While effective on simple scenes, each stage required hand‑tuned parameters and did not generalize well to complex real‑world images.

2. Deep Learning Era (post‑2012)

Two main paradigms emerged:

Two‑stage approach : first detect text lines (using regression‑based boxes, segmentation, or hybrid methods), then recognize the content (CTC‑based or attention‑based sequence models).

End‑to‑end approach : a single model jointly performs detection and recognition, improving speed and allowing mutual reinforcement between tasks.

Amap adopted a hybrid framework that combines the strengths of both paradigms.

Current Amap Text Recognition Framework

The pipeline consists of three modules:

Text‑line detection: predicts masks and line positions, uses Deformable Convolution (DCN) and ASPP to handle arbitrary orientations and distortions.

Single‑character detection & recognition: Faster R‑CNN for character boxes, SENet‑based classifier covering >7,000 Chinese/English characters.

Sequence recognition: TPS‑Inception‑BiLSTM‑Attention model that rectifies perspective, extracts features via CNN, encodes with BiLSTM, and decodes with attention.

Detected line masks are converted to minimum bounding polygons for downstream processing. Online data augmentation (rotation, flip, mixup) is applied during training to improve robustness.

Detection results on public benchmarks (ICDAR 2013, ICDAR 2017‑MLT, ICDAR 2019‑ReCTS) show state‑of‑the‑art performance.

Recognition Results Fusion

Both single‑character and sequence recognizers are run; high‑confidence results (≥99% accuracy) from the sequence model are kept, while low‑confidence or structurally complex cases are supplemented by the character detector, achieving higher overall accuracy.

Sample Mining & Synthesis

To address rare characters and insufficient training data, Amap combines:

Mining real images that contain uncommon or traditional characters, followed by manual annotation.

Synthesizing text images using rendering pipelines to generate balanced datasets.

The mixed dataset dramatically improves recognition of obscure glyphs.

Future Directions and Challenges

Key research problems include:

Data scarcity : leveraging advanced augmentation (AutoAugment) and style‑transfer synthesis (e.g., SwapText) to generate high‑quality samples.

Model design for blurry text : integrating super‑resolution (SRGAN‑based TextSR) or GAN‑enhanced detection to recover details.

Semantic understanding : incorporating language models (e.g., SEED) to provide prior knowledge for ambiguous characters.

Edge deployment : developing lightweight, high‑accuracy OCR models suitable for on‑device inference, reducing bandwidth and server load.

By continuously refining algorithms and fusing multiple recognition results, Amap’s OCR system now automates over 70% of POI data creation and more than 90% of road information updates, significantly cutting labor costs.

Reference

1. Liao M et al., “Textboxes++: A single‑shot oriented scene text detector”, IEEE TIP, 2018. 2. Lyu P et al., “Mask TextSpotter: An End‑to‑End Trainable Neural Network for Spotting Text with Arbitrary Shapes”, 2018. 3. Zhou X et al., “EAST: An Efficient and Accurate Scene Text Detector”, 2017. 4. Shi B et al., “An End‑to‑End Trainable Neural Network for Image‑Based Sequence Recognition”, IEEE TPAMI, 2017. 5. Wojna Z et al., “Attention‑Based Extraction of Structured Information from Street View Imagery”, ICDAR, 2018. 6. Li H et al., “Towards End‑to‑end Text Spotting with Convolutional Recurrent Neural Networks”, 2017.

computer visiondeep learningOCRAMapmap data automationscene text recognition
Amap Tech
Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.