Artificial Intelligence 27 min read

Query Understanding in JD Daojia E‑commerce Search: Architecture, Core Algorithms, and Experimental Results

This article presents a comprehensive overview of JD Daojia's query understanding system for e‑commerce search, detailing its overall architecture, core modules such as tokenization, term weighting, query rewriting, intent detection, the algorithms employed, experimental evaluations, and future directions.

Dada Group Technology
Dada Group Technology
Dada Group Technology
Query Understanding in JD Daojia E‑commerce Search: Architecture, Core Algorithms, and Experimental Results

1. Introduction

Search is the primary traffic entry for the JD Daojia app, covering various entry points such as homepage, in‑store, channel, and mini‑program searches. Accurately understanding user queries and ranking the most relevant results at the top are critical for search experience. Query understanding in e‑commerce involves lexical, syntactic, and semantic parsing to transform raw queries into structured representations that feed both retrieval and ranking modules.

2. Overall Architecture

The search pipeline proceeds from query understanding to retrieval and then ranking. Query understanding provides features for both recall and ranking, influencing overall system intelligence. Typical modules include preprocessing, correction, expansion, normalization, suggestion, segmentation, intent recognition, term importance analysis, and sensitive query detection. JD Daojia’s O2O scenario adds category inclination and store demand identification. The flow diagram is shown below.

Example query "康师傅红烧方便面*" is processed through segmentation, preprocessing, term weighting, rewriting, entity recognition, and intent identification, yielding structured entities such as Brand, Attribute, and Entity.

3. Core Algorithms of Query Understanding

3.1 Segmentation

3.1.1 Segmentation Techniques

Segmentation splits a query into terms (e.g., "康师傅|红烧|方便面"). Accuracy directly affects downstream modules like term importance and intent detection. JD Daojia uses a DAG‑based statistical segmentation model with steps: dictionary loading, DAG construction, dynamic programming for maximum‑probability path, and back‑tracking.

Dictionary is stored as a prefix trie.

All possible segmentations are represented as a DAG; dynamic programming finds the highest‑probability path based on term frequencies.

Probabilities are log‑transformed to avoid underflow.

Balancing coarse and fine granularity improves both precision and recall.

3.1.2 New‑Word Discovery

Unregistered words are discovered using statistical measures such as pointwise mutual information (cohesion) and left/right neighbor entropy (freedom). Words with high cohesion and entropy are added to the dictionary after manual verification.

3.2 Term Weighting

Term importance influences retrieval and ranking. Methods include TF‑IDF, static importance (IMP) based on click data, user‑click‑based weighting, and supervised feature‑learning models (LR, XGBoost, LSTM). JD Daojia adopts a pairwise ranking model with feature vectors (offline IQF/IDF/clicks and online POS/semantic embeddings) trained via cross‑entropy loss and gradient descent (preferring batch GD for stability).

3.3 Query Rewriting

Rewriting expands a query into semantically equivalent variants to improve recall. Approaches covered are edit‑distance/pinyin similarity, collaborative filtering on click co‑occurrence, knowledge‑graph synonym substitution, machine‑translation with reinforcement learning, and the proprietary Query2Vec session‑based method.

3.3.1 Our Approach

We combine collaborative filtering (QueryCF), SimRank/SimRank++ graph‑based similarity, Query2Vec session embeddings, and synonym‑based token replacement. Results are weighted by confidence and merged for parallel recall.

3.3.3 Experimental Effect

Rewriting yields relative improvements of 0.33% in click‑through rate, 0.4% in conversion rate, and 2.28% in ARPU.

Click‑through Rate

Conversion Rate

ARPU

Relative Lift

0.33%

0.4%

2.28%

3.4 Intent Recognition

Intent detection faces challenges such as noisy input, ambiguity, cold‑start, and lack of direct quantitative metrics. The pipeline extracts components (brand, product, attribute, theme) and predicts category inclination using hierarchical multi‑label classification or semantic models fused with click/transaction features.

3.4.1 Component Extraction

Entity recognition uses a Bi‑LSTM+CRF model trained on annotated query data with tags B‑brand, I‑brand, B‑attr, etc., achieving ~93% accuracy.

3.4.2 Category Prediction

Two strategies are employed: hierarchical multi‑label classification using CNNs (fastText/BERT embeddings) with fine‑tuning across label levels, and a fusion model combining click‑based statistical scores with a BERT+GBDT semantic model. Feature set includes semantic similarity scores, price ranges, recall statistics, and token counts.

3.4.3 Evaluation

Precision, recall, and F1 are computed in a multi‑label setting by averaging per‑sample contributions. The intent system reaches 93% overall accuracy, with downstream business metrics showing an 8.42% increase in search ARPU.

4. Summary and Outlook

The paper details JD Daojia’s query understanding pipeline, covering segmentation, term weighting, rewriting, and intent detection, along with practical implementations and experimental gains. Future work includes incorporating user personalization, richer contextual signals, and deep reinforcement learning to better handle long‑tail queries.

5. References

SimRank: A Measure of Structural‑Context Similarity SimRank++: Query Rewriting through Link Analysis of the Click Graph Context and Content‑aware Embeddings for Query Rewriting in Sponsored Search HFT‑CNN: Learning Hierarchical Category Structure for Multi‑label Short Text Categorization 基于DNN+GBDT的Query类目预测融合模型

e-commercemachine learningsearch enginenatural language processingquery understanding
Dada Group Technology
Written by

Dada Group Technology

Sharing insights and experiences from Dada Group's R&D department on product refinement and technology advancement, connecting with fellow geeks to exchange ideas and grow together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.