Artificial Intelligence 14 min read

Advances in Query Understanding and Semantic Retrieval at Zhihu Search

This article details Zhihu Search's engineering solutions for long‑tail query challenges, covering historical development, term weighting, synonym expansion, query rewriting with reinforcement learning, and semantic recall using BERT‑based models, while also outlining future research directions such as GAN‑based rewriting and lightweight pre‑training.

DataFunSummit
DataFunSummit
DataFunSummit
Advances in Query Understanding and Semantic Retrieval at Zhihu Search

Zhihu Search has evolved since 2016, replacing external systems in 2017 and forming an algorithm team in early 2018 that rapidly iterated on query understanding techniques to address the growing long‑tail query problem inherent to a Q&A platform.

Key challenges of long‑tail queries include input errors, redundant expressions, and semantic gaps; solutions involve automatic correction, term weight calculation using IDF and click‑through statistics, and n‑gram based dynamic weighting.

The query processing pipeline now comprises multiple sub‑tasks—error correction, tokenization, term weighting, synonym expansion, entity and intent recognition—producing a structured query for downstream retrieval.

Recall combines three queues: two inverted‑index based (original and rewritten queries) and one embedding‑based vector search, merging results before a final ranking stage.

Term weighting started with IDF, then incorporated click data and n‑gram aggregation, eventually moving to embedding‑driven dynamic weights predicted by an MLP using query‑contextualized vectors.

Synonym expansion relies on internal logs (session and click data) and pre‑trained word embeddings, supplemented by external resources and rule‑based extraction from encyclopedic patterns.

Query rewriting treats the task as a monolingual translation problem using a Google‑style NMT model, refined with parallel query‑query data mined from user behavior; reinforcement learning (policy gradient) with a value network further improves rewrite quality and online coverage.

Semantic recall represents queries and documents as vectors and performs nearest‑neighbor search; representation‑based models (e.g., BERT encoders with pooling) are used due to scalability, with additional optimizations such as mask‑token pre‑training on Zhihu data and hard‑negative mining via clustering.

Future work includes applying GANs for faster reward estimation in query rewriting, developing smaller pre‑trained models for broader deployment, and extending reinforcement‑learning techniques to other components like term weighting.

reinforcement learningsemantic searchquery understandingBERTquery rewritingembedding retrievalterm weighting
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.