Artificial Intelligence 16 min read

Unlocking Data Potential: Automatic Data Augmentation, Denoising, Active Learning, and Data Splitting

The talk explains how to maximize the value of training data by exploring background on model generalization, automatic data augmentation techniques, denoising strategies, active learning for selecting unlabeled samples, and robust data splitting methods, offering practical guidelines for AI practitioners.

DataFunSummit
DataFunSummit
DataFunSummit
Unlocking Data Potential: Automatic Data Augmentation, Denoising, Active Learning, and Data Splitting

The presentation begins with a background on the core problem of machine learning—reducing generalization error—emphasizing that both model simplicity and sufficient data volume are essential for achieving low error.

It then discusses the two fundamental questions about data quantity: whether more data is always better (considering feature width and depth) and how to estimate the required data size for a task, introducing the 10‑samples‑per‑parameter rule of thumb.

Next, the speaker outlines the four key steps for extracting the maximum value from data: automatic data augmentation, automatic denoising, active‑learning‑based selection of unlabeled data, and intelligent data splitting.

Automatic Data Augmentation – Augmentation improves model robustness by introducing invariant variations; online augmentation is preferred over offline because it provides stochastic perturbations that help escape local minima. State‑of‑the‑art methods such as Fast AutoAugment and AutoAugment are highlighted, along with Tencent’s internal augmentation library that combines multiple techniques with searchable probability and intensity parameters. Specific NLP schemes (Tree EDA, CBERT) and CV schemes (label‑aware augmentations, Bayesian‑guided search) are described.

Automatic Denoise – Denoising is framed as quality estimation. Three approaches are compared: (1) training a binary classifier on manually labeled good/bad samples, (2) training a ranking model based on human‑assigned scores, and (3) direct regression using objective metrics (e.g., HTER for MT). The preferred Predictor‑Estimator framework is introduced, where a predictor extracts embeddings and mismatch features, and an estimator scores data quality. Applications to ImageNet denoising, including label errors, multi‑object noise, and fine‑grained label confusion, are presented along with a custom loss (MixupCrossEntropy) to handle multi‑label cases.

Active Learning for Unlabeled Data Selection – Active learning selects the most informative samples for labeling, achieving near‑full‑dataset performance with far fewer examples. The speaker cites experiments showing that 15 k active‑selected samples can approximate the results of 30 k random samples, and notes the workflow’s suitability for fast‑iteration pipelines where labeling and training occur simultaneously.

Data Splitting and Adversarial Validation – To address the gap between validation and test performance, adversarial validation is proposed: a binary classifier distinguishes training from test data after stripping labels. A high classification accuracy indicates distribution drift, prompting the use of the classifier to select training samples that better match the test distribution.

The talk concludes by summarizing that these four steps—augmentation, denoising, active selection, and robust splitting—constitute a comprehensive strategy for pushing data to its performance limits in AI projects.

Machine Learningdata augmentationAIdata qualityactive learningautomatic denoise
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.