Advances in Geographic Text Processing for Map Search: Query Analysis, Error Correction, Rewriting, and Omission
Recent advances in map‑search text processing replace rule‑based pipelines with machine‑learning and deep‑learning models for query analysis, error correction, rewriting, and omission, using phonetic and spatial entity correction, vector‑based similarity, and CRF sequence labeling within a three‑stage architecture of analysis, recall, and ranking to deliver more precise POI results.
Map app functionality can be summarized as positioning, searching, and navigation, which answer the questions of where, where to go, and how to go. In the context of Amap (Gaode) map search, the input consists of a geographic query, user location, and map view, and the output is the POI the user wants. Accurately finding the desired POI and improving user satisfaction is the most critical metric for evaluating search effectiveness.
Typical search engines consist of three stages: query analysis, recall, and ranking. Query analysis aims to understand the meaning of the query and guide the subsequent stages.
Map search query analysis not only involves generic NLP techniques such as tokenization, constituent parsing, synonym handling, and spelling correction, but also incorporates domain‑specific intents such as city analysis, where‑what analysis, and route‑planning analysis.
Common query intents in map scenarios are illustrated in the figure below:
Query analysis is a strategy‑intensive part of the search engine and typically leverages a variety of NLP technologies. In map search the text is usually short and the user expects a very small set of highly precise results, which makes accurate text analysis especially challenging.
II. Overall Technical Architecture
Similar to generic retrieval systems, the map retrieval architecture consists of three main components: query analysis, recall, and ranking. The user’s input may express multiple intents, and the system issues parallel recall requests. After obtaining results for each intent, a global decision selects the best outcome.
Query analysis can be divided into basic query analysis and application‑specific query analysis. Basic analysis uses generic NLP techniques (tokenization, constituent parsing, omission detection, synonym handling, spelling correction). Application analysis tackles map‑specific problems such as city identification, where‑what detection, and route‑planning intent recognition.
The technology evolution for geographic text processing has moved from rule‑based methods to machine‑learning approaches, and finally to deep‑learning models. Because the search module serves high‑concurrency online traffic, introducing deep models required careful performance optimization. As those constraints eased, deep learning was gradually incorporated, yielding significant quality gains.
Recent advances in NLP (e.g., BERT, XLNet) have enabled a unified vector representation for all query‑analysis sub‑tasks, allowing multi‑task seq2seq learning while keeping the system lightweight.
III. General Query Analysis Techniques Evolution
3.1 Error Correction
In search engines, users often submit queries with spelling errors. Directly searching with erroneous queries typically fails to retrieve the intended results, so both general and vertical search engines perform query correction to maximize the probability of recovering the user’s intended query.
In map search, about 6‑10% of user requests contain errors, making query correction a crucial module for improving search experience.
Challenges include handling low‑frequency and long‑tail errors, and leveraging the structured nature of map queries (e.g., address components) to improve correction accuracy.
Common error categories:
(1) Same or similar pinyin (e.g., 盘桥物流园 → 潘桥物流园)
(2) Similar glyphs (e.g., 河北冒黎 → 河北昌黎)
(3) Missing or extra characters (e.g., 泉州州顶街 → 泉州顶街)
Original correction pipeline included multiple recall strategies:
Phonetic correction for short queries using identical or fuzzy pinyin.
Spelling correction via character‑level substitution and query frequency filtering.
Combination correction using a translation model trained on aligned query replacement resources.
Combination correction formula (translation model):
Identified problems:
Recall strategies struggled with low‑frequency cases.
Ranking was fragmented across independent modules, leading to sub‑optimal overall ordering.
Technical upgrades:
Spatial‑relation based entity correction using a geographic entity knowledge base with prefix‑tree and suffix‑tree indexes, enabling precise correction of low‑frequency district or entity errors.
Re‑architected ranking: decoupled recall from ranking and introduced a global pair‑wise GBRank model trained on online‑generated samples.
Features for the new ranking model include semantic features (language model scores), popularity features (PV, clicks), and basic features (edit distance, tokenization, distribution statistics).
These improvements addressed low‑frequency correction challenges and created a more modular pipeline ready for future deep‑learning upgrades.
3.2 Query Rewriting
Correction alone cannot handle many low‑frequency queries that are semantically similar to high‑frequency ones (e.g., “永城市新农合办” → “永城市新农合服务大厅”). To bridge this gap, a query rewriting approach rewrites rare queries into semantically similar frequent queries.
The solution consists of three stages: recall, ranking, and filtering.
Recall uses sentence embeddings (SIF) and the Faiss vector search engine (or an internal high‑performance engine) to retrieve candidate high‑frequency queries.
Ranking builds training samples from original‑query/high‑frequency‑candidate pairs, computes semantic similarity, and uses XGBoost regression with features such as basic text features, edit distance, and combined features.
Filtering mitigates over‑generalization of vector recall by applying alignment models (GIZA and FastAlign, with FastAlign chosen for speed) to ensure the rewritten query remains faithful to the original intent.
This rewriting pipeline fills the gaps left by correction and synonym expansion, providing a more robust handling of low‑frequency expressions.
3.3 Query Omission
Many map queries contain stop‑words or irrelevant terms that hinder effective recall (e.g., “厦门市搜‘湖里区县后高新技术园新捷创运营中心11楼1101室 县后brt站’”). An omission module identifies and removes such terms, focusing on core terms for recall.
Challenges include balancing prior omission decisions with posterior POI recall effectiveness.
The original rule‑based omission relied heavily on upstream constituent analysis, limiting robustness.
Technical upgrade replaced rules with a CRF sequence‑labeling model, supplemented by deep‑learning generated samples. Features include weighted term features, part‑of‑speech tags, dictionary cues, constituent analysis, and entropy‑based statistical cues.
Sample construction started with coarse online labeling and outsourced fine labeling, yielding tens of thousands of examples; bootstrapping with deep models expanded this to millions.
The CRF‑based omission model, enriched with non‑constituent features and large bootstrapped samples, significantly improved robustness and recall quality.
Overall, the article presents the evolution of geographic text processing techniques in map search, covering error correction, query rewriting, and omission, and demonstrates how rule‑based systems have been progressively upgraded to machine‑learning and deep‑learning solutions.
Amap Tech
Official Amap technology account showcasing all of Amap's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.