Artificial Intelligence 20 min read

Generative Retrieval for E‑commerce Search: Lexical‑Based and Semantic‑ID Approaches

This article presents a comprehensive study of generative retrieval in large‑scale e‑commerce search, detailing lexical‑based and SemanticID‑based methods, their challenges such as long‑tail distribution and token length, experimental evaluations, the discovered "sandglass" effect, and proposed solutions to improve recall and efficiency.

JD Retail Technology

Dec 9, 2024

Generative Retrieval for E‑commerce Search: Lexical‑Based and Semantic‑ID Approaches

The authors Wang Huimu and Li Mingming, participants of DataFunsummit2024, introduce generative retrieval for e‑commerce search, focusing on two directions: lexical‑based and SemanticID‑based approaches.

In the current retrieval pipeline, the dual‑tower architecture (representation + index) struggles with efficiency, precision in semantic matching, and long‑tail data, prompting the exploration of generative retrieval that directly maps queries to product titles using large language models.

Four core advantages of generative retrieval are highlighted: avoiding link loss, simplifying index management, improving model performance with advanced LLMs, and enhancing knowledge fusion for cold‑start and long‑tail promotion, while challenges such as product representation difficulty, long text, noise, and high training cost remain.

The lexical‑based strategy leverages natural language tokens for text representation, but faces issues like short queries versus long titles, one‑to‑many mapping, and generation hallucination. To address these, the authors propose a "Preference‑Optimized Generative Retrieval" framework consisting of task redefinition (Query‑to‑MultiSpan), supervised fine‑tuning, DPO‑based preference optimization, and constrained beam search.

Experimental results show that after supervised fine‑tuning and DPO, recall on head queries improves while maintaining gains on medium‑long tail queries, outperforming dense retrieval baselines (e.g., DSSM, RSR) and other generative baselines (SEAL, TIGER).

For the SemanticID‑based direction, the authors discuss the "sandglass" phenomenon caused by residual quantization, where middle‑layer tokens become overly concentrated, leading to sparse paths and long‑tail distribution that degrade model performance.

Two mitigation strategies are proposed: (1) heuristically removing the large routing node layer, and (2) a variable‑length token removal (top‑K) that adaptively drops dominant tokens, both validated on LLaMA models with improved metrics.

Future work aims to enhance SemanticID representations, integrate static and dynamic features, and unify generative retrieval with ranking to reduce pipeline loss and improve overall search quality.

The article concludes with a Q&A session addressing computational cost and ranking work, followed by a brief introduction of the JD Search Algorithm team and recruitment information.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

.ai Large Language Models E-commerce Search Generative Retrieval lexical approach Semantic ID

Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.