Training Neural Networks with Minimal Labeled Data Using Active Learning
This article explains how active learning can dramatically reduce the amount of labeled data required for training deep neural networks by selecting the most informative and representative samples, and provides a complete Python implementation of a hybrid query strategy (DBAL) with ResNet‑18.
Introduction
In supervised deep learning, obtaining large amounts of labeled data is often costly and time‑consuming. Active learning addresses this problem by selecting only the most valuable samples for annotation, thereby minimizing the total labeling effort while maintaining model performance.
What Is Active Learning?
Active learning (also called deep active learning when applied to modern deep models) aims to achieve the highest possible performance with as few labeled examples as possible. The process typically starts with a large pool of unlabeled data U and a small labeled set L₀. An initial model is trained on L₀ (or a pretrained model), and then the model’s predictions on U guide the selection of new samples for labeling.
Active Learning Pipeline
The pipeline consists of three main steps:
Feature extraction: feed each unlabeled sample into the current model to obtain either class probabilities or a feature vector.
Query strategy: rank the unlabeled samples according to a chosen criterion and select the top k samples.
Oracle labeling: send the selected samples to a human annotator (the “oracle”) and add the newly labeled data to the training set for the next iteration.
Informativeness‑Based Query Strategies
These strategies select samples that the model is most uncertain about.
Lowest‑confidence method : compute the maximum class probability P(y*|x) for each sample and rank samples by increasing confidence.
Maximum‑entropy sampling : calculate the entropy H(x) = -∑ₖ P(yₖ|x) log P(yₖ|x); higher entropy indicates higher uncertainty.
Margin sampling : compute the difference between the top two class probabilities M(x) = 1 - (P(y*|x) - P(y**|x)). Larger margin values correspond to more uncertain samples.
An illustrative example shows two cat images, one of which wears glasses; all three criteria would select the second image because the classifier’s probability distribution is most ambiguous for that sample.
Representativeness‑Based Query Strategies
These strategies aim to choose samples that best represent the entire dataset, avoiding redundancy.
K‑means clustering : cluster the feature space into k groups and select a few samples from each cluster.
Core‑set (k‑center) selection : solve the k‑center problem to pick a subset whose maximum distance to any data point is minimized.
Hybrid Strategy – DBAL (Diverse Mini‑batch Active Learning)
DBAL combines uncertainty filtering with weighted K‑means clustering. The algorithm proceeds as follows:
Compute an uncertainty score (margin) for every unlabeled sample.
Pre‑filter the top β·k samples with the highest scores.
Run weighted K‑means on the filtered set, using the uncertainty scores as sample weights.
Select the sample closest to each cluster centroid as the final query set.
Reference: https://arxiv.org/abs/2009.00236
Code Implementation (Python 3.9, ResNet‑18)
Below is a self‑contained implementation of DBAL. It imports the necessary libraries, defines the DBAL class, and provides a query method that returns the selected images.
#!/usr/bin/python3
import os
import torch
import numpy as np
from torchvision import transforms
from torchvision.models import resnet18, ResNet18_Weights
from PIL import Image
from sklearn.cluster import KMeans
class DBAL:
"""Active learning query strategy based on DBAL"""
def __init__(self, k: int = 20, beta: int = 10, image_size: int = 224, use_weighted_kmeans: bool = True):
self.k = int(k) # number of queried samples
self.beta = beta # pre‑filter factor
self.image_size = image_size
self.use_weighted_kmeans = use_weighted_kmeans
# load pretrained ResNet‑18 for classification
self.resnet = resnet18(weights=ResNet18_Weights.DEFAULT)
self.resnet.eval()
# load pretrained ResNet‑18 as feature extractor (remove final FC)
self.resnet_features = resnet18(weights=ResNet18_Weights.DEFAULT)
self.resnet_features.fc = torch.nn.Identity()
self.resnet_features.eval()
# image preprocessing
self.transform = transforms.Compose([
transforms.Resize(self.image_size),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
def query(self, image_path: str) -> list[str]:
"""Run DBAL active learning strategy. Returns list of selected images in image_path"""
files = os.listdir(image_path)
margins = []
X = np.zeros((len(files), 512), dtype=np.float32) # feature matrix
for i, file in enumerate(files):
# iterate through the unlabeled pool
img = Image.open(os.path.join(image_path, file))
img = self.transform(img)
img = torch.unsqueeze(img, 0)
predictions = self.resnet(img) # model output
features = self.resnet_features(img) # feature vector
X[i, :] = features.detach().numpy()
probabilities = torch.nn.functional.softmax(predictions, dim=1)[0]
P = sorted(probabilities.tolist())
# margin = 1 - (top‑prob – second‑top‑prob)
margins.append(1 - (P[-1] - P[-2]))
# ---- Pre‑filter to top β·k informative examples ----
indices = np.argsort(margins)
keep_idx = []
for i in range(-1, -self.k * self.beta - 1, -1):
keep_idx.append(indices[i])
X = X[keep_idx, :]
margins = [margins[i] for i in keep_idx]
files = [files[i] for i in keep_idx]
# ---- Weighted K‑means clustering ----
kmeans = KMeans(n_clusters=self.k, random_state=0, n_init=1,
tol=1e-4, verbose=1).fit(X, sample_weight=margins)
distances = kmeans.transform(X)
# ---- Select one sample closest to each centroid ----
images = []
for column in range(self.k):
idx = np.argmin(distances[:, column])
im = Image.open(os.path.join(image_path, files[idx]))
im.filename = str(files[idx])
images.append(im)
return imagesExperimental Results
The DBAL implementation was run on the Oxford‑IIIT Pet dataset using a pretrained ResNet‑18. With k=20 and beta=10, the unlabeled pool was first reduced to 200 images, then 20 images were selected for expert labeling. The margin scores of the selected images ranged from 0.97 to 0.99, indicating high uncertainty, and the chosen samples covered diverse cat and dog breeds.
Conclusion
Active learning reduces the labeling burden by focusing on the most informative or representative samples. Query strategies based on informativeness (uncertainty), representativeness (clustering), or a hybrid of both (e.g., DBAL) can be implemented straightforwardly in Python using pretrained deep models and standard clustering libraries.
AI Algorithm Path
A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
