Massive Multi-Label Text Classification via Semantic Retrieval and Large AI Model
This article presents a method for massive multi-label text classification on Zhihu content by combining a semantic retrieval model with a proprietary large AI model, detailing the challenges of large label spaces, model architecture, loss optimization, and experimental results showing significant accuracy gains.
Abstract Zhihu, a high‑quality Chinese Q&A community, accumulates massive heterogeneous content. To improve user experience, it is essential to understand and tag this content precisely, enabling high‑quality distribution to interested users.
Content understanding involves entity extraction, keyword detection, semantic representation, and especially content tagging, which provides fine‑grained semantics for both operations and user interest modeling.
Traditional pre‑trained model‑based tagging works for a small, fixed label set, but scaling to hundreds of thousands or millions of dynamic labels degrades performance and raises training and annotation costs.
This paper first analyzes the limitations of conventional tag classification models and then introduces a large‑model‑driven approach for massive tag classification.
1. Mainstream Tag Classification Methods Multi‑label text classification typically fine‑tunes a pre‑trained language model (e.g., BERT) and applies a sigmoid‑activated fully‑connected layer to produce a K‑dimensional probability vector, where K is the number of tags. A threshold selects the final tags.
Figure 1: BERT pre‑training and fine‑tuning workflow for multi‑label classification.
2. Challenges of Mainstream Methods on Massive Tag Sets When the tag set reaches tens of thousands or more, several problems arise:
Long training cycles due to massive per‑tag sample requirements.
Poor generalization on tail tags with few or no labeled examples.
Slow iteration of the tag taxonomy, as adding new tags demands extensive re‑annotation and retraining.
One‑hot encoding fails to exploit semantic relationships among tags.
3. Large‑Model‑Based Massive Tag Classification Recent large models enable a new solution. We adopt a retrieval‑augmented multi‑label method that combines a semantic retrieval model with Zhihu’s proprietary ZhiHaiTu AI model.
Step 1: Convert the classification task into a retrieval problem. Fine‑tune a pre‑trained semantic retrieval model on annotated data to align texts and tags.
Step 2: Use the fine‑tuned retrieval model to recall candidate tags from the massive tag pool.
Step 3: Feed the original text and the candidate tags as a prompt to the ZhiHaiTu AI model, which selects the correct tags.
Figure 2: Retrieval model + ZhiHaiTu AI model workflow.
3.1 Model Architecture
The system consists of a semantic retrieval component (based on BGE) and a generative large model. The retrieval model quickly recalls semantically related tags, while the generative model refines the selection.
3.1.1 Retrieval Model We fine‑tune the BGE (BAAI General Embedding) model on Zhihu data, aligning text and tag embeddings. BGE, trained with RetroMAE, benefits from asymmetric encoder‑decoder structures and hard‑negative mining, improving dense retrieval performance.
Figure 3: RetroMAE training framework.
Hard‑negative mining further enhances the model’s ability to distinguish subtle semantic differences.
3.1.2 Generative Model After retrieving K candidate tags, we construct a prompt that includes the input text and the candidates, feeding it to the ZhiHaiTu AI model. The model outputs probabilities for each token of a tag; averaging token probabilities yields a tag‑level score.
3.1.3 Overall Pipeline
Figure 4: End‑to‑end system architecture.
3.2 Experiment Design and Optimizations
We improved the loss function by masking intra‑batch false negatives and enriched tag semantics with explanatory information (e.g., “Apple – Apple Inc., founder Steve Jobs, products iPhone, iPad, Mac, Apple Watch”).
Figure 5: Loss masking process.
During inference, we encode the query, compute similarity with all tag embeddings, and retrieve the top‑K tags (K≈10 balances recall and downstream generation difficulty).
3.3 Results On Zhihu’s self‑labeled dataset, the proposed method outperforms a baseline BERT‑fine‑tune approach, achieving ~15‑20% absolute gains in accuracy and coverage for both short (questions) and long (answers/articles) texts.
Table 1: Model performance comparison.
4. Conclusion Combining a fine‑tuned semantic retrieval model with a large generative AI model yields high accuracy, strong scalability, low annotation cost, and rapid tag‑taxonomy iteration for massive multi‑label classification tasks.
Zhihu Tech Column
Sharing Zhihu tech posts and exploring community technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.