Active Learning: Concepts, Query Strategies, and Applications
Active Learning is a machine learning approach that reduces labeling costs by iteratively selecting the most informative samples for human annotation, using various query strategies such as uncertainty sampling, query-by-committee, expected model change, and density-weighted methods, applicable to domains like image classification, security risk control, and anomaly detection.
Active Learning Background Introduction
Machine learning includes supervised, unsupervised, semi‑supervised, and reinforcement learning. Supervised and semi‑supervised learning require labeled data, which can be costly in real‑world scenarios, prompting the need for methods that obtain high‑value annotations with lower expense.
In industrial image annotation, despite datasets like ImageNet, many specialized business scenarios still require costly manual labeling. Examples include security risk control, where malicious users are scarce, and operational monitoring, where failures are rare, leading to imbalanced samples and high labeling effort.
Academics refer to this problem as Active Learning . The process integrates human labeling into the machine‑learning pipeline by selecting difficult samples for annotation, then retraining supervised or semi‑supervised models iteratively to improve performance.
Without Active Learning, samples are typically chosen randomly or by simple heuristics, which still incurs relatively high labeling costs.
An analogy: a student uses a mistake notebook to focus on frequently wrong questions; similarly, Active Learning selects hard‑to‑classify samples for labeling to efficiently boost model accuracy.
The overall Active Learning workflow adds two steps—candidate set extraction and human annotation—to the usual steps of model training, prediction, and update.
Machine learning model: training and prediction.
Candidate set extraction: relies on a query function.
Human annotation: expert or business knowledge.
Obtaining labeled candidates: acquiring valuable samples.
Model update: incremental or full retraining with newly labeled data.
Iterating this loop enables rapid model improvement. Typical application domains include personalized spam/SMS classification and various anomaly‑detection tasks.
Active Learning models are categorized into two types: Sequential (stream‑based) Active Learning and Pool‑based (offline batch) Active Learning , allowing practitioners to choose the most suitable approach for their scenario.
Query Strategies (Core of Active Learning)
Common query strategies include:
Uncertainty Sampling
Query‑By‑Committee (QBC)
Expected Model Change
Expected Error Reduction
Variance Reduction
Density‑Weighted Methods
Uncertainty Sampling
Uncertainty sampling selects samples that the model finds difficult to classify. Three typical methods are:
Least Confident
Margin Sampling
Entropy
Least Confident
For binary or multi‑class models, each sample receives a probability distribution over classes. The sample with the smallest maximum probability (i.e., the lowest confidence) is chosen for labeling. Formula:
Here denotes the trained model parameters, and is the class with the highest predicted probability.
Margin Sampling
Margin sampling selects samples whose top two predicted probabilities are closest, i.e., the difference between the highest and second‑highest probabilities is minimal. Formula:
In binary classification, Least Confident and Margin Sampling are equivalent.
Entropy
Entropy measures the uncertainty of the entire probability distribution; samples with high entropy are selected. Formula:
Compared with Least Confident and Margin Sampling, entropy considers all class probabilities rather than only the top one or two.
Query‑By‑Committee (QBC)
QBC extends uncertainty sampling to multiple models. A committee of models, each trained on the same dataset, vote on unlabeled samples. Disagreement among the models guides candidate selection.
Vote Entropy: selects samples where committee votes are most uncertain.
Average Kullback‑Leibler (KL) Divergence: selects samples with large average KL divergence among model predictions.
Vote Entropy
Entropy over committee votes quantifies disagreement. Formula:
Average KL Divergence
KL divergence measures the distance between two probability distributions; averaging it across the committee highlights samples with divergent predictions. Formula:
Expected Model Change
Selects samples that would cause the largest change in the model’s gradient when added to the training set.
Expected Error Reduction
Chooses samples that would most reduce the loss function after being labeled and incorporated.
Variance Reduction
Selects samples that would most decrease the model’s variance.
Density‑Weighted Methods
Incorporates sample density to avoid selecting outliers; dense, uncertain samples are preferred. Formula:
Here, denotes a specific uncertainty‑sampling method, is an exponent parameter, and represents the number of classes. The weighting favors samples similar to class representatives.
Points near region B contain more information than points near region A.
Summary
Active Learning focuses on selecting informative samples for human labeling using various query strategies—either based on a single model or a committee of models—to reduce labeling costs and rapidly improve model performance across many fields such as image recognition, natural language processing, security risk control, and time‑series anomaly detection.
Original link: https://zhuanlan.zhihu.com/p/239756522
References
Settles, Burr. "Active learning literature survey." University of Wisconsin‑Madison, 2009.
Aggarwal, Charu C., et al. "Active learning: A survey." Data Classification: Algorithms and Applications, CRC Press, 2014, pp. 571‑605.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.