How Gaode Maps Boosts Accuracy with Advanced Scene Text Recognition

This article explains how Gaode Maps leverages traditional and deep‑learning based scene text recognition techniques—including character detection, sequence models, data synthesis, and multi‑stage frameworks—to automate POI and road data production with high precision and speed.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How Gaode Maps Boosts Accuracy with Advanced Scene Text Recognition

Background

Gaode Maps, a national‑level app with over 100 million daily active users, provides massive query, positioning, and navigation services. To improve user experience, the richness and accuracy of map data are crucial. Traditional map data collection relied on manual editing of field‑collected data, leading to slow updates and high costs. Gaode therefore adopted image‑recognition technology to automatically extract map elements from massive image collections, enabling real‑time updates of POI (Point of Interest) and road data.

Evolution of Text Recognition Technology

Traditional Image Algorithms

Before 2012, text recognition depended on classic image processing and statistical machine‑learning methods, typically divided into image preprocessing, character recognition, and post‑processing. Preprocessing involved region localization, rectification, and segmentation using techniques such as connected‑component analysis, MSER, affine transforms, binarization, and projection analysis. Recognition used hand‑crafted features (e.g., HOG) or CNN‑extracted features classified by SVMs. Post‑processing applied rules and language models for correction. These pipelines worked in simple scenes but required extensive parameter tuning for each new scenario and struggled with complex environments.

Deep‑Learning Algorithms

After 2012, deep learning became dominant in computer vision, prompting a shift to end‑to‑end or two‑stage text‑recognition pipelines. The two‑stage approach first detects text lines (using regression‑based, segmentation‑based, or hybrid methods) and then recognizes the content (via CTC‑based or attention‑based models). The end‑to‑end approach jointly learns detection and recognition within a single model, improving real‑time performance and allowing mutual reinforcement of the two tasks.

Text Recognition Framework

Gaode’s current framework consists of three modules: text‑line detection, character‑level detection & recognition, and sequence recognition. The detection module predicts masks to handle vertical, distorted, or curved text; the sequence module recognizes text within detected regions; and the character module supplements the sequence model for artistic or irregular fonts.

Text recognition framework
Text recognition framework

Text Line Detection

Natural‑scene text varies in shape, scale, orientation, and quality. Gaode improves detection by enhancing a two‑stage instance‑segmentation model with deformable convolutions (DCN) for multi‑directional features, enlarging the mask branch, and integrating an ASPP module. Online data augmentation (rotation, flipping, mixup) further boosts generalization. The model outputs both segmentation masks and minimal‑area convex hulls for downstream recognition.

Detection results example
Detection results example

Benchmarking on ICDAR2013, ICDAR2017‑MLT, and ICDAR2019‑ReCTS shows competitive scores.

Text line detection competition results
Text line detection competition results

Text Recognition

Gaode requires two business‑oriented metrics: (1) Text‑line full‑match rate – the proportion of lines correctly recognized with correct reading order; (2) High‑confidence line rate – the proportion of lines whose confidence exceeds 99 %.

Full‑match rate evaluates overall POI and road name recognition.

High‑confidence rate measures the algorithm’s ability to isolate highly reliable results.

To meet these needs, Gaode combines character‑level detection (Faster R‑CNN + SENet, supporting >7 000 Chinese/English characters) and sequence recognition (TPS‑Inception‑BiLSTM‑Attention). The character model achieved second place in the ICDAR2019‑ReCTS competition, with only a 0.09 % gap to the winner.

Character detection and recognition results
Character detection and recognition results

Sequence models such as Aster and DTRT perform text region rectification, feature extraction, sequence encoding, and attention‑based decoding, handling multi‑directional and distorted text.

General sequence recognition architecture
General sequence recognition architecture

Gaode’s pipeline applies TPS‑based perspective correction, rescaling, and padding before feeding images to a CNN‑BiLSTM‑Attention model, achieving strong performance on English, simplified Chinese, and traditional Chinese character sets, especially for artistic and blurry text.

Sequence recognition results
Sequence recognition results

Sample Mining & Synthesis

Rare characters and uncommon words appear on road signs and POI plates. Gaode augments training data by (1) mining real images containing such characters from its database for manual annotation, and (2) synthesizing text images using rendering techniques. Mixing real and synthetic samples dramatically improves recognition of scarce characters.

Sample mining and synthesis pipeline
Sample mining and synthesis pipeline

Summary

Through iterative refinement of detection, character, and sequence modules, Gaode’s OCR system meets diverse real‑world requirements, automating more than 70 % of POI data and over 90 % of road information. The approach reduces manual labor, cuts training costs, and demonstrates the practical impact of computer‑vision techniques on large‑scale map production.

Future Development and Challenges

Data Layer

Insufficient annotated data remains a bottleneck. Strategies such as AutoAugment (reinforcement‑learning‑driven augmentation) and synthetic data generation (e.g., Alibaba’s SwapText) are explored to expand the dataset.

Model Layer

Blurred Text Recognition

Blur hampers detection and recognition. Super‑resolution methods like TextSR (SRGAN‑based) and GAN‑integrated detection networks improve clarity without heavy computational overhead.

Text Semantic Understanding

Incorporating language models (e.g., SEED) enables the system to leverage semantic priors, enhancing recognition of ambiguous or complex characters.

Other Directions

Edge deployment is a growing trend to reduce bandwidth and server load. Research focuses on lightweight OCR architectures that retain high accuracy while fitting on‑device constraints.

References Liao M et al. Textboxes++: A single‑shot oriented scene text detector. IEEE Transactions on Image Processing, 2018. Lyu P et al. Mask TextSpotter: An End‑to‑End Trainable Neural Network for Spotting Text with Arbitrary Shapes, 2018. Zhou X et al. EAST: An Efficient and Accurate Scene Text Detector, 2017. Shi B et al. An End‑to‑End Trainable Neural Network for Image‑Based Sequence Recognition and Its Application to Scene Text Recognition, 2017. Wojna Z et al. Attention‑Based Extraction of Structured Information from Street View Imagery, 2018. Li H et al. Towards End‑to‑End Text Spotting with Convolutional Recurrent Neural Networks, 2017.
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Computer VisionDeep LearningOCRmap data automationscene text recognition
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.