Query Understanding and Processing in E‑commerce Search Systems
This article explains the end‑to‑end pipeline of query understanding for e‑commerce search, covering preprocessing, segmentation, spell correction, normalization, and expansion, and discusses both academic research and industry implementations with examples and references.
In e‑commerce search, user queries are extremely short and often contain spelling errors, ambiguities, or inaccurate expressions, so precise query understanding is crucial for search effectiveness.
1. Query Preprocessing
Preprocessing is relatively simple and rule‑based, preparing the query for downstream modules. Typical steps include:
Operational Review & Intervention : manual review, replacement, or other interventions for bad cases.
Normalization : case conversion, simplified/traditional Chinese conversion, half‑width/full‑width conversion, removal of symbols and emojis.
Length Truncation : cutting overly long queries.
Other : case‑driven strategies.
2. Query Segmentation
Segmentation splits a query into multiple terms (e.g., "手机淘宝" → "手机 淘宝"). While English can be tokenized by spaces, Chinese segmentation is more complex and is a fundamental NLP task. Practitioners often use open‑source tools such as Jieba, HanLP, PyLTP, or LAC, or rely on dedicated internal platforms.
Review of Chinese Word Segmentation Studies [1]
NLP Segmentation Algorithm Survey [2]
3. Query Rewriting
Query rewriting enriches the original query to improve recall and relevance, addressing synonyms, user errors, and ambiguous expressions. It is divided into three sub‑modules:
Query Spell Correction
Query Normalization
Query Expansion
3.1 Query Spell Correction
Spelling errors arise from user input habits and can degrade recall and ranking. The typical pipeline includes error detection and error correction.
3.1.1 Common Errors
Different businesses categorize errors differently; for example, Tencent classifies errors based on whether the query contains out‑of‑vocabulary words.
3.1.2 Technical Solutions
Solutions are split into pipeline and end‑to‑end approaches.
3.1.2.1 Pipeline Methods
The pipeline separates detection and correction.
Error Detection : dictionary lookup, n‑gram language model, or sequence labeling models (e.g., Bi‑LSTM‑CRF, BERT‑CRF).
Error Correction : candidate generation via edit distance, HMM, or deep models, followed by ranking.
3.1.2.2 End‑to‑End Methods
End‑to‑end models jointly optimize detection and correction.
Soft‑Mask BERT (ByteDance AI Lab) : a Bi‑GRU detector predicts error probabilities, which are used to soft‑mask embeddings before a BERT‑based correction network.
SpellGCN (Ant Financial) : a graph convolutional network leverages phonetic and visual similarity of Chinese characters for spelling correction.
PLOME (Tencent) : incorporates pinyin and stroke sequences via GRU encoders into BERT, predicting both character and correct pinyin probabilities.
3.1.3 Industry Implementations
Real‑world cases include Baidu's Chinese correction technology, HIT‑iFLY text correction, Ping An Life AI correction, Alibaba's ASR correction, Didi's NLU exploration, and Meituan's query rewriting pipelines.
Baidu: Chinese Spell Correction
HIT‑iFLY Text Correction System [3]
Ping An Life AI: Text Correction Technology [4]
Alibaba: ASR Correction in Voice Dialogue
XiaoAi: BERT‑based ASR Correction
Didi: Voice Interaction NLU Exploration
FluentU: Automatic Grammar Correction
3.2 Query Normalization
Normalization (or synonym substitution) maps long‑tail or non‑standard queries to popular standard forms, improving recall. Techniques include knowledge‑base rule mining, behavior‑driven unsupervised embeddings, and deep matching or seq2seq models to align semantically similar query pairs.
3.3 Query Expansion
When a query is vague, expansion uncovers latent user intent and broadens recall. Expansion can be at the term level (replacing terms with synonyms) or at the full‑query level (generating alternative queries). A typical pipeline involves offline candidate generation from logs, translation, graph methods, and embeddings, followed by online refinement using high‑precision dictionaries, SMT‑based models, XGBoost ranking, reinforcement‑learning NMT, and vector‑based recall for merchant search.
Offline mining of millions of candidate phrases from logs, translation, graph embeddings, and word vectors.
Filtering candidates with a BERT‑based semantic discriminator.
Online deployment using four strategies: high‑precision dictionary rewrite, SMT + XGBoost model rewrite, reinforcement‑learning NMT for long‑tail queries, and vector‑based recall for merchant search.
References
[1] Review of Chinese Word Segmentation Studies: https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I2/3/1
[2] NLP Segmentation Algorithm Survey: https://zhuanlan.zhihu.com/p/50444885
[3] HIT‑iFLY Text Correction System: http://cogskl.iflytek.com/archives/1306
[4] Ping An Life AI Text Correction: https://zhuanlan.zhihu.com/p/159101860
[5] DXY: Query Expansion Techniques in Search: https://zhuanlan.zhihu.com/p/138551957
[6] DXY: Query Expansion Techniques (Part 2): https://zhuanlan.zhihu.com/p/296504323
[7] Meituan: Exploration and Practice of Query Rewriting: https://tech.meituan.com/2022/02/17/exploration-and-practice-of-query-rewriting-in-meituan-search.htm
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.