Artificial Intelligence 14 min read

Improving Zhihu Search: Query Understanding, Term Weighting, Synonym Expansion, Query Rewriting, and Semantic Retrieval

This article details Zhihu's search engineering advances over the past year, covering long‑tail query challenges, term‑weight calculation, synonym expansion, query rewriting with translation models and reinforcement learning, and semantic retrieval using BERT‑based embeddings, while outlining future research directions.

DataFunTalk
DataFunTalk
DataFunTalk
Improving Zhihu Search: Query Understanding, Term Weighting, Synonym Expansion, Query Rewriting, and Semantic Retrieval

As Zhihu’s user base and product offerings grew, the search system faced increasing long‑tail query challenges, making query understanding essential for improving recall quality.

Long‑tail queries exhibit input errors, redundant expressions, and semantic gaps; examples include misspellings like "塞尔维雅" for "塞尔维亚" and ambiguous phrases such as "高跟鞋消音".

To address these issues, Zhihu employs three sub‑tasks: automatic spelling correction, term‑weight calculation, and synonym expansion. Term weight is first estimated using inverse document frequency (IDF) and then refined with click‑through statistics at the n‑gram level, allowing dynamic adaptation to query context.

Synonym expansion relies on internal logs (user sessions, query clicks) and pre‑trained word embeddings to discover parallel terms; external resources are also mined using rule‑based patterns.

Query rewriting treats the problem as a monolingual translation task, using a Google‑style NMT model trained on parallel query → query data extracted from user behavior. The pipeline includes n‑gram language‑model filtering, relevance filtering, and BPE tokenization.

Reinforcement learning (policy gradient) fine‑tunes the rewrite model by defining the search system as an environment and the rewrite model as an agent; a value network predicts expected reward to mitigate sparse‑reward issues, resulting in a >50% increase in rewrite‑derived rewards.

Semantic retrieval represents both queries and documents as vectors and performs nearest‑neighbor search. Representation‑based models (e.g., BERT encoders with cosine similarity) are preferred for large‑scale indexing, supplemented by pre‑training enhancements such as mask‑token training on Zhihu data and hard‑negative mining via clustering.

Future work includes applying GANs for faster reward estimation in query rewriting, compressing pre‑trained models for broader deployment, and extending reinforcement‑learning techniques to other components like term‑weight estimation.

NLPreinforcement learningquery understandingsemantic retrievalSearchquery rewritingterm weightingsynonym expansion
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.