Artificial Intelligence 19 min read

Data‑Centric AI Practices for Content Moderation at NetEase Yidun

The article presents NetEase Yidun’s data‑centric AI approach to content moderation, covering the background of Data‑Centric AI, the specific business and data challenges of content safety, comprehensive data pipelines—including collection, labeling, augmentation, selection, cleaning, iteration and testing—and the role of self‑, semi‑ and weak‑supervised learning in enhancing algorithm performance.

DataFunSummit
DataFunSummit
DataFunSummit
Data‑Centric AI Practices for Content Moderation at NetEase Yidun

With the rapid development of AI, NetEase Yidun leverages Data‑Centric AI to improve the efficiency of harmful content interception, emphasizing the need to understand data scenarios, define and manage data, and maximize its value for algorithmic innovation.

The concept of Data‑Centric AI, originally proposed by Andrew Ng, highlights that data contributes the most to AI performance; however, both academia and industry often under‑invest in data, focusing instead on models and solutions.

In the content moderation domain, massive internet data streams contain a tiny fraction of violations, leading to extreme long‑tail distributions, highly similar non‑violating features, tiny target objects, and open‑domain categories that constantly evolve.

Yidun’s data workflow includes:

Data definition and characterization to reduce model and labeling difficulty.

Data collection and generation to support cold‑start and continuous expansion.

Multi‑level labeling with refined documentation, model‑assisted pre‑labeling, and selective human annotation.

Data augmentation, selection, and cleaning to ensure high‑quality training samples.

Iterative data cycles that integrate semi‑supervised, self‑supervised, and weakly supervised methods for stability, domain adaptation, and fine‑grained supervision.

Testing and validation align data and metrics with real‑world deployment goals, ensuring that offline performance translates to online effectiveness.

The Q&A section discusses effective integration of model pre‑labeling with human annotation and strategies for handling ambiguous class boundaries, including hierarchical model design and retrieval‑based augmentation.

Overall, Yidun’s Data‑Centric AI framework aims to minimize resource consumption while maximizing data value across the entire AI lifecycle for content safety.

machine learningdata managementself-supervised learningcontent moderationSemi-supervised LearningData-Centric AIAlgorithm Innovation
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.