Artificial Intelligence 20 min read

Generative Retrieval for E‑commerce Search: Lexical‑Based and Semantic‑ID Approaches

This article presents a comprehensive study of generative retrieval in large‑scale e‑commerce search, detailing lexical‑based and SemanticID‑based methods, their challenges such as long‑tail distribution and token length, experimental evaluations, the discovered "sandglass" effect, and proposed solutions to improve recall and efficiency.

JD Retail Technology
JD Retail Technology
JD Retail Technology
Generative Retrieval for E‑commerce Search: Lexical‑Based and Semantic‑ID Approaches

The authors Wang Huimu and Li Mingming, participants of DataFunsummit2024, introduce generative retrieval for e‑commerce search, focusing on two directions: lexical‑based and SemanticID‑based approaches.

In the current retrieval pipeline, the dual‑tower architecture (representation + index) struggles with efficiency, precision in semantic matching, and long‑tail data, prompting the exploration of generative retrieval that directly maps queries to product titles using large language models.

Four core advantages of generative retrieval are highlighted: avoiding link loss, simplifying index management, improving model performance with advanced LLMs, and enhancing knowledge fusion for cold‑start and long‑tail promotion, while challenges such as product representation difficulty, long text, noise, and high training cost remain.

The lexical‑based strategy leverages natural language tokens for text representation, but faces issues like short queries versus long titles, one‑to‑many mapping, and generation hallucination. To address these, the authors propose a "Preference‑Optimized Generative Retrieval" framework consisting of task redefinition (Query‑to‑MultiSpan), supervised fine‑tuning, DPO‑based preference optimization, and constrained beam search.

Experimental results show that after supervised fine‑tuning and DPO, recall on head queries improves while maintaining gains on medium‑long tail queries, outperforming dense retrieval baselines (e.g., DSSM, RSR) and other generative baselines (SEAL, TIGER).

For the SemanticID‑based direction, the authors discuss the "sandglass" phenomenon caused by residual quantization, where middle‑layer tokens become overly concentrated, leading to sparse paths and long‑tail distribution that degrade model performance.

Two mitigation strategies are proposed: (1) heuristically removing the large routing node layer, and (2) a variable‑length token removal (top‑K) that adaptively drops dominant tokens, both validated on LLaMA models with improved metrics.

Future work aims to enhance SemanticID representations, integrate static and dynamic features, and unify generative retrieval with ranking to reduce pipeline loss and improve overall search quality.

The article concludes with a Q&A session addressing computational cost and ranking work, followed by a brief introduction of the JD Search Algorithm team and recruitment information.

AIlarge language modelse-commerce searchgenerative retrievallexical approachsemantic ID
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.