How Amap’s Scene Text Recognition Powers Accurate Maps: Evolution and Future Challenges
This article explains how Amap leverages scene text recognition to automate map data production, detailing the evolution from traditional image algorithms to deep‑learning models, the current detection and recognition framework, performance results, and future research directions for handling blur, data scarcity, and semantic understanding.
Background
As a national‑level software with over 100 million daily active users, Amap provides massive query, positioning, and navigation services. The richness and accuracy of map data determine user experience. Traditional map data collection relied on manual editing of field‑collected data, resulting in slow updates and high costs. To address this, Amap adopted image‑recognition technology to directly extract map elements from collected images, enabling automated production of POI (Point of Interest) and road data.
Text Recognition Technology Evolution and Practice
STR algorithm development timeline
Scene Text Recognition (STR) development can be divided into two stages around 2012: the traditional image‑algorithm stage and the deep‑learning stage.
Traditional image algorithms
Image preprocessing: text region localization, correction, character segmentation using connected‑component analysis, MSER, affine transformation, binarization, projection analysis, etc.
Text recognition: handcrafted features (e.g., HOG) or CNN‑extracted features classified by SVM or similar classifiers.
Post‑processing: rule‑based and language‑model corrections.
These methods work well in simple scenes but require extensive parameter tuning for each scenario and struggle with complex, diverse environments.
Deep learning algorithms
After 2012, deep learning became dominant in computer vision, leading to two main STR solutions: a two‑stage pipeline (text line detection + recognition) and end‑to‑end models that jointly perform detection and recognition.
Two‑stage text recognition
The pipeline first detects text lines using regression‑based, segmentation‑based, or hybrid methods, then recognizes the content via CTC‑based or attention‑based sequence models.
End‑to‑end text recognition
A single model simultaneously handles detection and recognition, improving real‑time performance and allowing joint training to boost both tasks.
Text Recognition Framework
Amap’s current framework consists of three modules: text line detection, single‑character detection & recognition, and sequence recognition. The detection module predicts text masks and line positions, using DCN for directional features, an enlarged mask branch, and ASPP to improve segmentation accuracy. Detected masks are converted to minimum bounding polygons for downstream processing.
Text line detection
The detector handles varied scales, orientations, and quality of natural‑scene text. It integrates DCN for directional features, expands the mask branch, and adds an ASPP module. Online data augmentation (rotation, flipping, mixup) improves generalization. Detection results are shown below.
Evaluation on public benchmarks (ICDAR2013, ICDAR2017‑MLT, ICDAR2019‑ReCTS) demonstrates superior performance.
Text recognition
Business requirements demand both high completeness of text line recognition and a subset with >99% confidence. Two metrics are defined:
Text line full‑match rate: proportion of lines correctly recognized with correct order.
High‑confidence proportion: proportion of lines with confidence >99%.
Two complementary approaches are used:
Single‑character detection & recognition
Based on Faster R-CNN for detection and a SENet‑based recognizer supporting over 7,000 Chinese/English characters. Optimizations include identity‑mapping, MobileNetV2‑style skip connections, and extensive data augmentation, achieving second place in ICDAR2019‑ReCTS.
Sequence recognition
Modern sequence recognizers (e.g., Aster, DTRT) perform text region correction, feature extraction, sequence encoding, and decoding. A TPS‑Inception‑BiLSTM‑Attention architecture is employed to handle multi‑directional text, correcting perspective distortion and feeding the result into a CNN‑BiLSTM‑Attention pipeline.
Sample mining & synthesis
To address rare characters and low‑frequency words in POI and road signs, Amap combines real‑sample mining (identifying images containing uncommon characters for manual annotation) with synthetic data generation via image rendering. Mixing real and synthetic samples significantly improves recognition performance.
Future Development and Challenges
Key challenges include data scarcity for diverse Chinese characters, handling blurred images, and improving model efficiency for edge deployment. Research directions cover data augmentation (AutoAugment, style‑transfer synthesis), blur‑text super‑resolution (TextSR, GAN‑based methods), semantic understanding by integrating language models (e.g., SEED), and lightweight on‑device OCR frameworks.
Data side
Automatic data expansion via augmentation strategies (AutoAugment) or synthetic generation (e.g., Alibaba’s SwapText) is essential when manual labeling is limited.
Model side
Blurred text detection benefits from super‑resolution networks integrated into detection pipelines, while semantic priors from NLP improve recognition accuracy. End‑to‑end lightweight models aim to reduce cloud bandwidth and server load for edge deployment.
References 1. Liao M et al. Textboxes++: A single‑shot oriented scene text detector. IEEE Transactions on Image Processing, 2018. 2. Lyu P et al. Mask TextSpotter: An End‑to‑End Trainable Neural Network for Spotting Text with Arbitrary Shapes, 2018. 3. Zhou X et al. EAST: An Efficient and Accurate Scene Text Detector, 2017. 4. Shi B et al. An End‑to‑End Trainable Neural Network for Image‑Based Sequence Recognition and Its Application to Scene Text Recognition, IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017. 5. Wojna Z et al. Attention‑Based Extraction of Structured Information from Street View Imagery, ICDAR 2017. 6. Li H et al. Towards End‑to‑End Text Spotting with Convolutional Recurrent Neural Networks, 2017.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
