Artificial Intelligence 21 min read

Generative Retrieval for E‑commerce Search: Lexical and SemanticID Approaches

This article presents a comprehensive study of generative retrieval for large‑scale e‑commerce search, detailing background challenges, the advantages of generative methods, two concrete strategies—Lexical‑based and SemanticID‑based—along with task redesign, preference optimization, constrained beam search, extensive experiments, and future research directions.

DataFunSummit
DataFunSummit
DataFunSummit
Generative Retrieval for E‑commerce Search: Lexical and SemanticID Approaches

The article begins by describing the current bottlenecks in e‑commerce search recall, where traditional dual‑tower architectures struggle with efficiency, precision on long‑tail queries, and index maintenance, motivating the exploration of generative retrieval that directly maps queries to relevant product titles using large language models.

Four core benefits of generative retrieval are highlighted: eliminating pipeline loss, simplifying index management, improving model performance through advanced LLMs, and leveraging world knowledge for better personalization and cold‑start handling. However, challenges such as long token sequences, noise, and training difficulty remain.

Two solution pathways are investigated. The Lexical‑based approach keeps the natural language token space, redefining the task from Query‑to‑Title to Query‑to‑MultiSpan , applying supervised fine‑tuning, preference optimization (DPO), and constrained beam search to generate high‑quality spans that are later matched against an efficient FM index.

The SemanticID‑based approach encodes product texts into compact numeric identifiers via residual quantization, exposing a “sand‑glass” distribution where middle‑layer tokens become overly concentrated, harming long‑tail performance. Experiments on LLaMA, Qwen, and Baichuan models confirm this effect.

To mitigate the sand‑glass phenomenon, two methods are proposed: (1) heuristically removing the dominant routing layer, and (2) an adaptive variable‑length token removal strategy that prunes top‑K high‑frequency tokens while preserving informative ones. Both improve recall and reduce computational cost.

Extensive evaluations compare dense retrieval baselines (DSSM, RSR) with generative baselines (SEAL, TIGER) using Recall@K metrics, demonstrating that preference‑optimized generative models close the gap on head queries while excelling on mid‑ and long‑tail queries.

Future work focuses on enhancing SemanticID representations, integrating static and dynamic features, and unifying generative retrieval with ranking to create a single, end‑to‑end model for the entire search pipeline.

large language modelse-commerce searchpreference optimizationgenerative retrievallexical approachsemantic IDrecall metrics
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.