Artificial Intelligence 14 min read

Zhihu Search Text Relevance Evolution and BERT Knowledge Distillation Practices

This talk by Zhihu search algorithm engineer Shen Zhan details the evolution of text relevance models from TF‑IDF/BM25 to deep semantic matching and BERT, explains the challenges of deploying BERT at scale, and describes practical knowledge‑distillation techniques that improve both online latency and offline storage while maintaining search quality.

DataFunTalk
DataFunTalk
DataFunTalk
Zhihu Search Text Relevance Evolution and BERT Knowledge Distillation Practices

The presentation begins with an overview of Zhihu's search text relevance, defining relevance as the match between user query intent and retrieved document content, and distinguishing between literal matching and semantic relevance.

It then traces the evolution of relevance models in three stages: (1) early bag‑of‑words approaches using TF‑IDF/BM25, (2) deep semantic matching models such as dual‑tower encoders (e.g., DSSM) and interaction models (e.g., Match‑Pyramid, KNRM), and (3) the adoption of BERT, which provides powerful contextual representations for both representation and interaction models.

The speaker discusses the practical deployment of BERT in Zhihu's search pipeline, noting the high computational cost of interaction models and the trade‑off of using representation models with offline‑precomputed document vectors and online query encoding.

Knowledge distillation is introduced as a solution to reduce model size and latency. The talk explains the concept of soft targets and temperature scaling, and reviews common distillation schemes, including MiniLM and teacher‑student frameworks.

Specific distillation experiments are described: using larger teacher models (e.g., RoBERTa‑large) to train a 6‑layer student model, applying Patient KD with combined cross‑entropy and normalized MSE losses, and compressing vector dimensions from 768 to 64 while preserving retrieval performance.

Results show significant gains: online latency reduced by ~40 ms, GPU usage halved, storage for semantic indexes cut by up to 75 %, and offline indexing time reduced to one‑quarter, all with minimal loss in relevance metrics (nDCG comparable to or better than the original BERT‑base).

The session concludes with a summary of the benefits of BERT distillation for both online serving and offline indexing, and thanks the audience.

machine learningmodel compressionknowledge distillationBERTSearch Relevancesemantic retrieval
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.