Artificial Intelligence 15 min read

Active Learning: Concepts, Query Strategies, and Applications

Active Learning is a machine learning approach that reduces labeling costs by iteratively selecting the most informative samples for human annotation, using various query strategies such as uncertainty sampling, query-by-committee, expected model change, and density-weighted methods, applicable to domains like image classification, security risk control, and anomaly detection.

DataFunTalk
DataFunTalk
DataFunTalk
Active Learning: Concepts, Query Strategies, and Applications

Active Learning Background Introduction

Machine learning includes supervised, unsupervised, semi‑supervised, and reinforcement learning. Supervised and semi‑supervised learning require labeled data, which can be costly in real‑world scenarios, prompting the need for methods that obtain high‑value annotations with lower expense.

In industrial image annotation, despite datasets like ImageNet, many specialized business scenarios still require costly manual labeling. Examples include security risk control, where malicious users are scarce, and operational monitoring, where failures are rare, leading to imbalanced samples and high labeling effort.

Academics refer to this problem as Active Learning . The process integrates human labeling into the machine‑learning pipeline by selecting difficult samples for annotation, then retraining supervised or semi‑supervised models iteratively to improve performance.

Without Active Learning, samples are typically chosen randomly or by simple heuristics, which still incurs relatively high labeling costs.

An analogy: a student uses a mistake notebook to focus on frequently wrong questions; similarly, Active Learning selects hard‑to‑classify samples for labeling to efficiently boost model accuracy.

The overall Active Learning workflow adds two steps—candidate set extraction and human annotation—to the usual steps of model training, prediction, and update.

Machine learning model: training and prediction.

Candidate set extraction: relies on a query function.

Human annotation: expert or business knowledge.

Obtaining labeled candidates: acquiring valuable samples.

Model update: incremental or full retraining with newly labeled data.

Iterating this loop enables rapid model improvement. Typical application domains include personalized spam/SMS classification and various anomaly‑detection tasks.

Active Learning models are categorized into two types: Sequential (stream‑based) Active Learning and Pool‑based (offline batch) Active Learning , allowing practitioners to choose the most suitable approach for their scenario.

Query Strategies (Core of Active Learning)

Common query strategies include:

Uncertainty Sampling

Query‑By‑Committee (QBC)

Expected Model Change

Expected Error Reduction

Variance Reduction

Density‑Weighted Methods

Uncertainty Sampling

Uncertainty sampling selects samples that the model finds difficult to classify. Three typical methods are:

Least Confident

Margin Sampling

Entropy

Least Confident

For binary or multi‑class models, each sample receives a probability distribution over classes. The sample with the smallest maximum probability (i.e., the lowest confidence) is chosen for labeling. Formula:

Here denotes the trained model parameters, and is the class with the highest predicted probability.

Margin Sampling

Margin sampling selects samples whose top two predicted probabilities are closest, i.e., the difference between the highest and second‑highest probabilities is minimal. Formula:

In binary classification, Least Confident and Margin Sampling are equivalent.

Entropy

Entropy measures the uncertainty of the entire probability distribution; samples with high entropy are selected. Formula:

Compared with Least Confident and Margin Sampling, entropy considers all class probabilities rather than only the top one or two.

Query‑By‑Committee (QBC)

QBC extends uncertainty sampling to multiple models. A committee of models, each trained on the same dataset, vote on unlabeled samples. Disagreement among the models guides candidate selection.

Vote Entropy: selects samples where committee votes are most uncertain.

Average Kullback‑Leibler (KL) Divergence: selects samples with large average KL divergence among model predictions.

Vote Entropy

Entropy over committee votes quantifies disagreement. Formula:

Average KL Divergence

KL divergence measures the distance between two probability distributions; averaging it across the committee highlights samples with divergent predictions. Formula:

Expected Model Change

Selects samples that would cause the largest change in the model’s gradient when added to the training set.

Expected Error Reduction

Chooses samples that would most reduce the loss function after being labeled and incorporated.

Variance Reduction

Selects samples that would most decrease the model’s variance.

Density‑Weighted Methods

Incorporates sample density to avoid selecting outliers; dense, uncertain samples are preferred. Formula:

Here, denotes a specific uncertainty‑sampling method, is an exponent parameter, and represents the number of classes. The weighting favors samples similar to class representatives.

Points near region B contain more information than points near region A.

Summary

Active Learning focuses on selecting informative samples for human labeling using various query strategies—either based on a single model or a committee of models—to reduce labeling costs and rapidly improve model performance across many fields such as image recognition, natural language processing, security risk control, and time‑series anomaly detection.

Original link: https://zhuanlan.zhihu.com/p/239756522

References

Settles, Burr. "Active learning literature survey." University of Wisconsin‑Madison, 2009.

Aggarwal, Charu C., et al. "Active learning: A survey." Data Classification: Algorithms and Applications, CRC Press, 2014, pp. 571‑605.

machine learningactive learningLabeling Cost ReductionUncertainty SamplingQuery Strategies
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.