Artificial Intelligence 13 min read

Query Understanding and Processing in E‑commerce Search Systems

This article explains the end‑to‑end pipeline of query understanding for e‑commerce search, covering preprocessing, segmentation, spell correction, normalization, and expansion, and discusses both academic research and industry implementations with examples and references.

DataFunTalk

Oct 16, 2022

Query Understanding and Processing in E‑commerce Search Systems

In e‑commerce search, user queries are extremely short and often contain spelling errors, ambiguities, or inaccurate expressions, so precise query understanding is crucial for search effectiveness.

1. Query Preprocessing

Preprocessing is relatively simple and rule‑based, preparing the query for downstream modules. Typical steps include:

Operational Review & Intervention : manual review, replacement, or other interventions for bad cases.

Normalization : case conversion, simplified/traditional Chinese conversion, half‑width/full‑width conversion, removal of symbols and emojis.

Length Truncation : cutting overly long queries.

Other : case‑driven strategies.

2. Query Segmentation

Segmentation splits a query into multiple terms (e.g., "手机淘宝" → "手机淘宝"). While English can be tokenized by spaces, Chinese segmentation is more complex and is a fundamental NLP task. Practitioners often use open‑source tools such as Jieba, HanLP, PyLTP, or LAC, or rely on dedicated internal platforms.

Review of Chinese Word Segmentation Studies [1]

NLP Segmentation Algorithm Survey [2]

3. Query Rewriting

Query rewriting enriches the original query to improve recall and relevance, addressing synonyms, user errors, and ambiguous expressions. It is divided into three sub‑modules:

Query Spell Correction

Query Normalization

Query Expansion

3.1 Query Spell Correction

Spelling errors arise from user input habits and can degrade recall and ranking. The typical pipeline includes error detection and error correction.

3.1.1 Common Errors

Different businesses categorize errors differently; for example, Tencent classifies errors based on whether the query contains out‑of‑vocabulary words.

3.1.2 Technical Solutions

Solutions are split into pipeline and end‑to‑end approaches.

3.1.2.1 Pipeline Methods

The pipeline separates detection and correction.

Error Detection : dictionary lookup, n‑gram language model, or sequence labeling models (e.g., Bi‑LSTM‑CRF, BERT‑CRF).

Error Correction : candidate generation via edit distance, HMM, or deep models, followed by ranking.

3.1.2.2 End‑to‑End Methods

End‑to‑end models jointly optimize detection and correction.

Soft‑Mask BERT (ByteDance AI Lab) : a Bi‑GRU detector predicts error probabilities, which are used to soft‑mask embeddings before a BERT‑based correction network.

SpellGCN (Ant Financial) : a graph convolutional network leverages phonetic and visual similarity of Chinese characters for spelling correction.

PLOME (Tencent) : incorporates pinyin and stroke sequences via GRU encoders into BERT, predicting both character and correct pinyin probabilities.

3.1.3 Industry Implementations

Real‑world cases include Baidu's Chinese correction technology, HIT‑iFLY text correction, Ping An Life AI correction, Alibaba's ASR correction, Didi's NLU exploration, and Meituan's query rewriting pipelines.

Baidu: Chinese Spell Correction

HIT‑iFLY Text Correction System [3]

Ping An Life AI: Text Correction Technology [4]

Alibaba: ASR Correction in Voice Dialogue

XiaoAi: BERT‑based ASR Correction

Didi: Voice Interaction NLU Exploration

FluentU: Automatic Grammar Correction

3.2 Query Normalization

Normalization (or synonym substitution) maps long‑tail or non‑standard queries to popular standard forms, improving recall. Techniques include knowledge‑base rule mining, behavior‑driven unsupervised embeddings, and deep matching or seq2seq models to align semantically similar query pairs.

3.3 Query Expansion

When a query is vague, expansion uncovers latent user intent and broadens recall. Expansion can be at the term level (replacing terms with synonyms) or at the full‑query level (generating alternative queries). A typical pipeline involves offline candidate generation from logs, translation, graph methods, and embeddings, followed by online refinement using high‑precision dictionaries, SMT‑based models, XGBoost ranking, reinforcement‑learning NMT, and vector‑based recall for merchant search.

Offline mining of millions of candidate phrases from logs, translation, graph embeddings, and word vectors.

Filtering candidates with a BERT‑based semantic discriminator.

Online deployment using four strategies: high‑precision dictionary rewrite, SMT + XGBoost model rewrite, reinforcement‑learning NMT for long‑tail queries, and vector‑based recall for merchant search.

References

[1] Review of Chinese Word Segmentation Studies: https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I2/3/1

[2] NLP Segmentation Algorithm Survey: https://zhuanlan.zhihu.com/p/50444885

[3] HIT‑iFLY Text Correction System: http://cogskl.iflytek.com/archives/1306

[4] Ping An Life AI Text Correction: https://zhuanlan.zhihu.com/p/159101860

[5] DXY: Query Expansion Techniques in Search: https://zhuanlan.zhihu.com/p/138551957

[6] DXY: Query Expansion Techniques (Part 2): https://zhuanlan.zhihu.com/p/296504323

[7] Meituan: Exploration and Practice of Query Rewriting: https://tech.meituan.com/2022/02/17/exploration-and-practice-of-query-rewriting-in-meituan-search.htm

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Search Engine natural language processing Query Processing Query Rewriting spell correction

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.