How Active Learning Can Cut Labeling Costs and Boost Model Performance

This article explains active learning techniques that let models select valuable training samples, reducing annotation costs and improving performance, and describes business‑specific adaptations, experiments, and results that demonstrate its effectiveness in content‑safety applications.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How Active Learning Can Cut Labeling Costs and Boost Model Performance

Background

In supervised machine learning, models learn from labeled historical data to improve generalization. Traditionally, the data are prepared in advance, and the model passively consumes whatever data are provided.

This article introduces an active learning approach that lets the model participate in selecting training samples, distinguishing samples in the pool according to various strategies to improve performance.

What is Active Learning?

Active Learning (AL) is a machine‑learning algorithm that actively finds the most valuable training samples to add to the training set. If a sample is unlabeled, it is sent for manual annotation before being used for training. In short, it aims to achieve high performance with fewer training samples.

Problems Addressed

High cost of data annotation, especially in domains requiring expert knowledge.

Huge data volume makes full‑scale training impractical or time‑consuming.

Value of Active Learning

Active learning reduces annotation cost and training resource cost, and can improve model performance with the same amount of data. For example, selecting 2 million samples from a 10 million‑sample pool can yield a model comparable to one trained on the full set.

Business Challenges

Improving model performance for the CRO content‑safety risk‑control model.

Leveraging massive online feedback data, which is costly to label and train.

Handling dirty data caused by inconsistent labeling standards.

Active‑Learning Algorithms Overview

Uncertainty‑Based Methods

These methods select samples where the current model is most uncertain. Representative strategies include Least‑Confidence (LC), Smallest‑Margin (SM), and Entropy (ENT). Images illustrate the scoring formulas.

Query‑By‑Committee (QBC)

QBC trains several models of the same architecture, lets them vote on samples, and selects those with the most disagreement. It compresses the version‑space to find the best decision boundary. Representative works include Seung 1992, DAS 2019, and Active‑Decorate.

Business‑Specific Design

Our scenario requires batch selection of tens of thousands of samples, efficient use of existing labeled data, data‑balance control, dirty‑data removal, and appropriate sample difficulty. We improve the Least‑Confidence method to satisfy these requirements, defining a value metric for each sample based on probability scores and hard/wrong classification.

Experiments & Results

Small‑Dataset Validation

Using a 300 k‑sample training set, we compared several active‑learning strategies by selecting 100 k samples and measuring TPR at FPR = 1 %. The HW method and QBC outperformed others.

Figures show recall comparison and ROC curves.

Production Deployment

Applying the HW method to a content‑moderation model increased recall at the same false‑positive rate, outperforming baseline methods.

Conclusion

Active learning can boost model performance by selecting the most valuable training samples. Our HW active‑learning algorithm extracts tens of thousands of high‑value samples in one pass, reduces labeling cost, maintains data balance, and improves business models.

Open Questions & Outlook

Future work includes evaluating the generality of dirty‑data‑ratio strategies, exploring noise‑learning techniques, and investigating multi‑round active learning for further performance gains.

References [1] B. Settles, “Computer Sciences Active Learning Literature Survey,” 2009. [2] H.S. Seung, M. Opper, and H. Sompolinsky, “Query by Committee,” 1992. [3] J. Phan, M. Ruocco, and F. Scibilia, “Dual Active Sampling on Batch‑Incremental Active Learning,” 2019. [4] P. Melville and R.J. Mooney, “Diverse ensembles for active learning,” 2004. [5] S.J. Huang, R. Jin, and Z.H. Zhou, “Active Learning by Querying Informative and Representative Examples,” 2014.
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

model performanceactive learningdata annotationbatch sampling
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.