Zero‑Cost Upgrade: OneSearch‑V2 Launches Generative Search, Boosting Buyers and Orders

OneSearch‑V2 introduces a zero‑cost generative search upgrade that leverages latent‑reasoning‑enhanced self‑distillation, thought‑augmented query understanding, and behavior‑feedback preference alignment, delivering offline HitRate gains of up to 2.68 % and online CTR, buyer and order increases of roughly 4 %, 2 % and 2 % respectively.

Machine Heart
Machine Heart
Machine Heart
Zero‑Cost Upgrade: OneSearch‑V2 Launches Generative Search, Boosting Buyers and Orders

Background

Generative retrieval in e‑commerce suffers from insufficient complex query understanding, weak personalized intent reasoning, and reward‑model over‑fitting. OneSearch‑V1 reduced inference cost and improved high‑frequency queries but still left a large portion of traffic (≈⅓) with low conversion.

OneSearch‑V2 Core Design

The paper proposes a Latent‑Reasoning‑Enhanced Self‑Distillation (LR‑ESD) framework composed of three modules:

Thought‑augmented Query Understanding : a keyword‑based chain‑of‑thought (CoT) generated by Qwen3‑32B that extracts intent, category, attributes and topics, providing dense semantic signals without increasing token length.

Reasoning‑internalized Self‑Distillation : an asymmetric teacher‑student distillation where the teacher receives the full keyword‑augmented input and the student receives only the raw query; KL loss forces the student to encode the reasoning into its weights, eliminating extra inference cost.

Behavior‑Feedback Preference Alignment : replaces the independent reward model with a composite reward built from relevance, calibrated post‑click CTR, and click‑order signals. A Token‑Position Marginal Advantage (TPMA‑GRPO) distributes credit according to the hierarchical SID sequence.

Encoding Experiments

OneSearch‑V2 retains the V1 KHQE+RQ‑OPQ encoding. Extensive offline tests compare single‑modal text encoding, multimodal unified encoding, and multimodal separate‑then‑concat encoding on ~5 M <query, item> pairs (titles + two main images). Results show single‑modal methods outperform multimodal (e.g., small‑scale bge‑base beats large Qwen3‑VL) because cross‑modal noise dilutes key attributes. The KHQE scheme achieves the best trade‑off between accuracy and real‑time latency.

Encoding comparison diagram
Encoding comparison diagram

Thought‑augmented Query Understanding Pipeline

The three‑step pipeline processes each query:

Query analysis : intent detection, hierarchical category matching, attribute extraction, and topic recommendation.

Keyword extraction : generate concise keyword CoT, enforce intent‑category‑attribute consistency, deduplicate synonyms, and rank by item popularity.

Preference calibration : use user profile and recent interactions to filter or augment the keyword set; during training, inject the currently viewed item as a strong signal.

Keyword‑driven CoT generation runs asynchronously and its results are cached for identical queries, yielding zero latency overhead.

Thought‑augmented pipeline diagram
Thought‑augmented pipeline diagram

Reasoning‑internalized Self‑Distillation Details

The teacher computes logits on query + keyword‑CoT while the student computes logits on query only. KL divergence aligns the student distribution to the teacher’s, and the loss combines standard cross‑entropy with the distillation term. Regularization includes:

R‑Drop: two forward passes with different dropout masks, minimizing their KL divergence.

FGM: adversarial perturbation of the embedding layer to improve input robustness.

Focal loss: mitigates long‑tail class imbalance in the SID vocabulary.

Ablation shows the self‑distilled student outperforms both the teacher (despite lacking keywords) and all alternative distillation strategies (special‑token, CODI‑style hidden‑state alignment, EMA teacher, joint training).

Self‑distillation architecture
Self‑distillation architecture

Behavior‑Feedback Preference Alignment

The composite reward R combines:

Relevance reward (four‑level scoring).

Posterior conversion reward (calibrated CTR).

Click and order reward (direct user actions).

TPMA‑GRPO decomposes sequence‑level advantage into position‑level marginal contributions, applying a prefix gate that only propagates gradients when the prefix matches the ground‑truth, thus respecting the hierarchical nature of SID generation.

TPMA‑GRPO diagram
TPMA‑GRPO diagram

Offline Evaluation

Using 30 k logged PVs (30 k clicks, 7 229 orders), models are evaluated with HitRate@10 and MRR@10. Incremental additions yield:

CoT tasks: +0.48 % Order HR@10.

Self‑distillation: +1.17 % Order HR@10, +1.67 % Click HR@10.

R‑Drop: MRR@10 ↑ 0.0028.

FGM: Order HR@10 ↑ 0.2180, Click HR@10 ↑ 0.2422.

Focal loss: Order HR@10 0.2214, Click HR@10 0.2471.

TPMA‑GRPO: final OneSearch‑V2 achieves Order HR@10 = 0.2314, Click HR@10 = 0.2568, a 2.68 % average HR@10 gain over baseline.

Offline results chart
Offline results chart

Online A/B Test

OneSearch‑V2 vs V1 shows statistically significant lifts (p < 0.05): CTR +3.98 %, page‑click +1.17 %, conversion +2.90 %, buyer count +2.07 %, order volume +2.11 %.

Online A/B test results
Online A/B test results

Deeper Analyses

Performance gains are consistent across user groups, query‑frequency buckets, and item hotness, with especially large improvements for low‑activity users and cold‑start items. The CTR curve changes from a “reverse‑U” in V1 to a “U” in V2, indicating better handling of both head and tail queries.

Industry‑level CTR uplift averages 3.98 % across categories; categories with ambiguous titles (apparel, shoes, cosmetics, hardware) see the largest gains.

CoT keyword coverage continuously rises in production, ensuring stable self‑distillation updates.

Trade‑off experiments confirm that higher relevance reward alone can reduce order conversion, while the full TPMA‑GRPO balance improves final order metrics.

During the 3.18 global shopping festival, increasing the relevance reward for emerging merchants boosted their rankings and demonstrated the system’s ability to adapt objectives in real time.

Future Directions

Planned work includes beyond‑log training for long‑tail queries, unified multimodal SID encoding for videos/live‑streams, and online learning mechanisms for agentic search systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

E-commerceQuery Understandingself-distillationGenerative SearchLatent ReasoningAI RankingBehavioral Feedback
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.