How Large Language Models Are Redefining Search Ranking at Tencent
This article details Tencent Search's exploration of large‑model‑driven ranking, covering the evolution from traditional keyword retrieval to RAG‑based AI search, the multi‑stage AI ranking architecture (L0‑L5), model training pipelines, distillation, synthetic data generation, and future research directions.
Background
Traditional keyword‑based search required users to craft queries and manually synthesize results. The rise of large language models (LLMs) such as ChatGPT introduced Retrieval‑Augmented Generation (RAG), enabling dynamic integration of external knowledge and reducing hallucinations.
AI‑Driven Search Architecture
The system follows a four‑stage workflow:
Planner : decides which components (search engine, plugins, agents) to invoke based on the user prompt.
Retrieval : executes the chosen components and returns candidate documents.
Rerank : refines the candidate list using a ranking model.
Agent : performs multi‑turn reasoning and reflection to handle complex sub‑queries.
This pipeline supports decomposition of conversational queries into sub‑queries and iterative refinement.
Layered Ranking Pipeline (L0‑L5)
L0 – Recall : coarse candidate retrieval from internal indexes and external APIs.
L1 – Coarse Ranking : lightweight scoring per shard.
L2 – Fine Ranking : richer features (basic relevance, temporal signals).
L3 – Deep Ranking : large model provides dynamic labeling and re‑ranking.
L4 – Hybrid Ranking : merges external API results with internal candidates.
L5 – Prompt‑Level Ranking : produces the final ordered list presented to the user.
Challenges in Modern Search
Conversational queries often contain rich context and multiple sub‑queries (e.g., “2025 LPL final match ranking?”). Traditional term‑based retrieval struggles with intent understanding, multi‑agent coordination, and iterative refinement.
Model Evolution and Training
From BERT‑pointwise (2020) to decoder‑style large models (2023) : early models used a CLS head for scoring; later models treat the LLM as a massive feature extractor with a dense head at the EOS token.
Distillation : a 14 B teacher model is distilled into 0.5 B–3 B student models using pointwise MSE loss and pairwise margin loss (Margin‑MSE).
Training Stages :
Continuous pre‑training on synthetic content, document‑title generation, and query generation.
Supervised fine‑tuning on large labeled datasets with an added pairwise loss for better discrimination.
Student‑model distillation to reduce inference cost.
Multi‑objective fine‑tuning to predict relevance, authority, freshness, and satisfaction scores.
LLM‑for‑Ranking Paradigms
Representational LLMs : BERT‑in‑code with a CLS linear head (2020) and later decoder‑based dense scoring (2023).
Generative LLM for Ranking : pointwise generation of a relevance score via Masked Token Prediction (MTP); the model predicts a 0‑1 score after softmax aggregation.
Reasoning‑Enhanced Ranking : chain‑of‑thought (CoT) prompts improve handling of complex queries but increase token count and latency.
Think‑Free Two‑Stage Training : template A (score + thought) and template B (thought only) are used for supervised fine‑tuning; at inference only template B runs, leveraging distilled reasoning from template A.
Automatic Sample Generation & Annotation
A two‑stage pipeline creates high‑quality training data:
User prompt is sent to a retrieval engine; a large model performs intent analysis on the retrieved documents.
The model annotates each document with relevance labels based on the inferred intent. Synthetic samples target failure modes such as missing top‑1 entities or irrelevant modifiers and are fed back into training.
Ranking Formulations
Four formulation families are explored:
Pointwise : predict a relevance score for each (query, passage) pair.
Pairwise : compare two passages for the same query and predict which is more relevant.
Listwise : output a sorted list for a query, optimizing the order of all candidates.
Setwise : identify the single most relevant passage without fully sorting the list, reducing computational cost.
From an efficiency perspective, pointwise models have the lowest latency, while pairwise achieves the best effectiveness but incurs O(N²) comparisons. The production system therefore adopts pointwise modeling for online deployment.
Chunk‑Level Indexing
To handle long documents, the system builds a hierarchical index:
Traditional document‑level retrieval based on keyword/embedding matching.
Additional chunk level index that stores fine‑grained semantic chunks, enabling precise retrieval of specific passages within large texts.
Agentic Search Enhancements
Query planning and function calling allow the LLM to decide which tools to invoke. The architecture evolves to an Agentic Search stage where multiple agents cooperate, and a reflection mechanism determines whether additional tool calls are needed.
Think‑Free Reasoning Ranking
Two‑stage SFT uses:
Template A (score + thought) for exhaustive training.
Template B (thought only) for inference, eliminating the costly score generation step.
Experiments on the open‑source Bright inference dataset show that even small models benefit from distilled reasoning.
Future Directions
RL‑based Reward Modeling : use ranking model outputs as rewards for reinforcement learning of the LLM, creating a closed‑loop improvement cycle.
Listwise Ranking : shift from pointwise to listwise modeling so the model can jointly consider multiple candidates, leveraging complementary information across documents.
These research avenues aim to further reduce hallucinations, improve relevance, and enable more efficient large‑scale deployment of AI‑augmented search systems.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
