Sentiment Classification and Topic Clustering for NetEase Cloud Music Comments
To boost NetEase Cloud Music’s comment handling, the authors combine active‑learning‑driven relabeling, domain‑specific MLM pretraining, contrastive‑learning‑based sample expansion, and multi‑task BERT sharing to raise sentiment‑classification precision and recall above 90 % and double moderation clean‑rate, while employing prompt‑generated story themes, IP‑focused classifiers, and hot‑word aggregation for effective short‑text topic clustering and scalable, theme‑aware distribution.
NetEase Cloud Music treats user comments as a core asset and seeks to understand their content to improve distribution and user experience. The core challenges in comment sentiment classification stem from the short, ambiguous nature of the text, difficulty in aligning manual labels, and the scarcity of positive samples, which together make accurate labeling time‑consuming and noisy.
To address these issues, the authors employ several strategies: active learning to identify and re‑label hard or contradictory samples, domain‑specific continued pretraining using a masked language model (MLM) on a 2‑billion‑token corpus of community texts (comments, video titles, square posts, lyrics), sample expansion and denoising via contrastive learning with RoFormer to generate high‑quality synthetic positives, and multi‑task learning that shares a BERT encoder across label‑specific classifiers to regularize and mitigate overfitting.
These optimizations push precision and recall for both positive and negative comment categories above 90%, yielding a 2‑3% absolute gain. Applying the negative‑label model to comment moderation raises the online comment clean‑rate by over 50%, while a large pool of high‑quality comments is accumulated for downstream distribution.
For comment topic clustering, the short length and lack of external context render standard clustering ineffective. The solution is decomposed into subtasks: (1) automatic generation of story‑type themes via prompt‑learning (PromptCLUE) with ~90% human‑evaluated accuracy; (2) IP‑related topics built by collecting IP‑linked comments, applying rule‑based rewriting for generalization, and fine‑tuning a text‑pair classifier; (3) hot‑topic detection through a hotword aggregation method that flags sudden spikes in comment volume.
The combined pipeline enables scalable, theme‑aware comment distribution, supporting fresh user engagement and future explorations in comment relevance and generation.
NetEase Cloud Music Tech Team
Official account of NetEase Cloud Music Tech Team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.