How Large Language Models Boost Search Relevance: A Real‑World Case Study

This article explains how a leading e‑commerce platform leveraged large language models to overcome traditional search relevance challenges, detailing the iterative workflow, model distillation, performance gains, deployment results, and future directions for smarter, more accurate product search.

DeWu Technology
DeWu Technology
DeWu Technology
How Large Language Models Boost Search Relevance: A Real‑World Case Study

Background

Users often encounter mismatched search results, such as seeing limited‑edition sneakers when looking for affordable student shoes, or heavy outdoor jackets when searching for lightweight winter coats. Search relevance measures how well results match user intent, directly affecting conversion and retention.

Traditional Relevance Iteration Pain Points

High resource consumption and labeling cost : Tens of millions of query‑item pairs require massive manual annotation, demanding dozens of full‑time staff for a year.

Limited scalability and slow iteration : Frequent business rule updates force re‑annotation, leading to long cycles.

Poor generalization on long‑tail scenarios : Models trained on common categories struggle with new or niche items.

LLM‑Based Iteration Process

Recent large language models (e.g., GPT, Qwen) offer stronger understanding, richer knowledge, and reduced data needs. By distilling a large model trained on ten‑thousand samples into a smaller online model, the team cut labeling costs dramatically while improving accuracy.

The workflow now includes a two‑stage relevance judgment: first understanding user intent and extracting key attributes, then verifying content matches those attributes. This raised overall accuracy from 75% to 80.95%.

Additionally, the “R1 slow‑thinking” approach incorporates chain‑of‑thought reasoning from models like DeepSeek R1, further boosting accuracy to 83.1% and improving long‑tail performance.

Modeling Search Relevance with Large Models

A two‑phase pipeline was designed: first, a knowledge‑distillation path replaces costly BERT training; second, the large model is integrated into the relevance problem‑discovery‑to‑solution loop, handling new‑term diagnosis, bad‑case monitoring, and automated sample generation.

Effects

After two quarters, the large model outperformed the online BERT baseline, increasing overall accuracy by 11.47% and macro‑average F1 by 16.21%. Gains were especially large on low‑score buckets, with up to 32.66% improvement. In long‑tail scenarios, accuracy rose 6.78% and macro‑F1 25.72%.

Deployment

Using large‑model‑generated labels for millions of items and multi‑stage data distillation, the online relevance bad‑case rate dropped by 5.39% overall and 10.82% in long‑tail cases, saving labeling costs in the six‑figure range. Offline evaluation showed a 9.58% accuracy lift for the distilled BERT model.

Conclusion

Large language models have markedly improved search relevance by better understanding intent and generalizing to unseen queries. Ongoing work focuses on generative listwise reinforcement learning for ranking and higher‑ROI distillation strategies to further close the gap with traditional metrics.

e-commerceAIlarge language modelsSearch Relevancemodel distillation
DeWu Technology
Written by

DeWu Technology

A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.