How AI Powers POI Signboard Image Retrieval for Map Services
This article explains the challenges of POI signboard image retrieval, describes a multimodal deep‑learning solution that combines visual and OCR‑based text features, details data generation, model architecture, loss functions, and presents impressive accuracy improvements and future research directions.
Background
POI (Point of Interest) data is a core component of digital maps, representing restaurants, shops, government offices, tourist attractions, and transportation facilities. While users need only name and location to find destinations, POI data also enables features like "search nearby" and reviews, increasing user engagement. Maintaining up‑to‑date POI data requires efficient processing of large‑scale images, as changes are infrequent but costly to handle manually.
To reduce operational costs, unchanged POIs must be filtered automatically, and image matching is the key technology for this task.
Technical Definition
Image retrieval is defined as: given a query image, search a large gallery for visually similar images. It relies on metric learning to pull together samples of the same class and push apart different classes, using losses such as contrastive, triplet, and center loss. Feature extraction includes global, local, and auxiliary features.
Problem Characteristics
POI signboard retrieval differs from typical tasks due to heterogeneous data sources, severe occlusion, and strong text dependence.
Heterogeneous data : Images come from low‑quality forward‑facing cameras and high‑quality side‑facing cameras, causing large variations in brightness, shape, and clarity.
Severe occlusion : Trees, vehicles, and other objects often block signboards, making alignment difficult.
Text dependence : Small changes in the POI name text should prevent a match, requiring multimodal fusion of visual and textual features.
Technical Solution
The solution consists of data iteration and model optimization.
Data generation : A cold‑start dataset is created automatically using SIFT matching between historical image collections; online human‑verified results are mined for iterative training data.
Multimodal retrieval model : Built on a triplet‑loss metric learning framework, the model ingests both image data and OCR‑extracted text (encoded by BERT) and fuses them for similarity measurement.
Data
Instance‑level labeling is expensive, so an automatic pipeline uses SIFT point matching to pair signboards across multiple data passes, filtering matches by inlier count. To mitigate noisy labels, multi‑pass matching, batch sampling, and MDR loss (a distance‑regularized extension of triplet loss) are employed.
Model
The model includes global and local visual branches and a textual branch.
Global features : Enhanced with spatial attention (SGE) and a modified backbone without the final down‑sampling block; GeM pooling replaces global average pooling for robustness.
Local features : A vertical split extracts region‑wise descriptors; alignment is performed by vertical pooling, similarity matrix computation, and shortest‑path matching (see Formula 1).
Text features : OCR results from multiple frames are concatenated with <SEP> tokens and encoded by BERT; the resulting vector is fused with visual features.
Model Performance
The multimodal system achieves >95% accuracy and recall, significantly improving online metrics and inference speed. Difficult cases involving subtle visual differences, occlusion, or lighting are largely resolved after optimization.
Future Development and Challenges
Remaining corner cases will be addressed through semi‑supervised or active learning for data augmentation and by incorporating Transformer‑based architectures for better feature extraction and multimodal fusion.
References
[1] Zhang X, Luo H, Fan X, et al. AlignedReID: Surpassing human‑level performance in person re‑identification. arXiv preprint arXiv:1711.08184, 2017.
[2] Kim, Yonghyun, and Wonpyo Park. "Multi‑level Distance Regularization for Deep Metric Learning." arXiv preprint arXiv:2102.04223, 2021.
[3] Radenović F, Tolias G, Chum O. Fine‑tuning CNN image retrieval with no human annotation. IEEE TPAMI, 2018.
[4] Li X, Hu X, Yang J. Spatial group‑wise enhance: Improving semantic feature learning in convolutional networks. arXiv preprint arXiv:1905.09646, 2019.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
