How a Simple Learning‑Rate Trick Detects 90% of Noisy Labels in Image Data
Training deep neural networks on large‑scale weakly labeled image data suffers from noisy annotations that degrade performance, but a simple algorithm that adjusts the learning‑rate during training can automatically identify up to 90% of noisy samples, improving dataset cleanliness and model accuracy without manual intervention.
Background
Obtaining high‑confidence annotations for massive datasets is a major challenge for supervised learning; noisy labels in the training set can severely reduce model accuracy. A simple, efficient noisy‑label detection algorithm proposed by the Alibaba Taobao technology team can reveal about 90% of noisy labels simply by adjusting the learning rate during training.
Solution Approach
We surveyed state‑of‑the‑art papers on noisy‑sample detection and robust training, including Influence Functions, CurriculumNet, and MentorNet. These works inspire a strategy that leverages the loss distribution of samples across different training phases to identify likely noisy instances.
Algorithm Design
The algorithm consists of three stages:
Stage 1: Train a model to convergence with a fixed learning rate, allowing the model to overfit.
Stage 2: Apply a cyclic learning‑rate schedule that repeatedly pushes the model between under‑fitting and over‑fitting. During under‑fitting, noisy samples exhibit high loss, while clean samples have low loss; the opposite occurs during over‑fitting. By aggregating the mean and variance of each sample’s loss across cycles, samples with large statistics are flagged as noisy.
Stage 3: Remove the identified noisy samples and retrain the model on the cleaned dataset.
Algorithm Performance
Extensive experiments on datasets built from noisy‑label collections show that our method outperforms several recent approaches (e.g., Influence Functions, CurriculumNet, MentorNet) in both noisy‑label detection precision and downstream model accuracy. The following figures illustrate loss curves under cyclic learning rates and comparative performance tables.
Application Scenario – Image Quality Service Platform (Waterdrop)
The noisy‑sample detection algorithm dramatically reduces manual labeling effort and improves the quality of image‑based services such as content‑library cover‑image moderation, multi‑object detection, and inappropriate content filtering. Deployed on the Waterdrop platform, it processes over 4 billion images weekly with >90% filtering precision, supporting Alibaba’s e‑commerce visual assets.
References
Classification in the Presence of Label Noise: A Survey
Co‑teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels
Mentor‑Net: Learning Data‑Driven Curriculum for Very Deep Neural Networks on Corrupted Labels
Understanding Black‑box Predictions via Influence Functions
Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach
Learning from Massive Noisy Labeled Data for Image Classification
A Closer Look at Memorization in Deep Networks
Training Deep Neural Network Using a Noise Adaptation Layer
CurriculumNet: Weakly Supervised Learning from Large‑Scale Web Images
CleanNet: Transfer Learning for Scalable Image Classifier Training With Label Noise
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
