Training Neural Networks with Minimal Labeled Data Using Active Learning

This article explains how active learning can dramatically reduce the amount of labeled data required for training deep neural networks by selecting the most informative and representative samples, and provides a complete Python implementation of a hybrid query strategy (DBAL) with ResNet‑18.

AI Algorithm Path
AI Algorithm Path
AI Algorithm Path
Training Neural Networks with Minimal Labeled Data Using Active Learning

Introduction

In supervised deep learning, obtaining large amounts of labeled data is often costly and time‑consuming. Active learning addresses this problem by selecting only the most valuable samples for annotation, thereby minimizing the total labeling effort while maintaining model performance.

What Is Active Learning?

Active learning (also called deep active learning when applied to modern deep models) aims to achieve the highest possible performance with as few labeled examples as possible. The process typically starts with a large pool of unlabeled data U and a small labeled set L₀. An initial model is trained on L₀ (or a pretrained model), and then the model’s predictions on U guide the selection of new samples for labeling.

Active Learning Pipeline

The pipeline consists of three main steps:

Feature extraction: feed each unlabeled sample into the current model to obtain either class probabilities or a feature vector.

Query strategy: rank the unlabeled samples according to a chosen criterion and select the top k samples.

Oracle labeling: send the selected samples to a human annotator (the “oracle”) and add the newly labeled data to the training set for the next iteration.

Informativeness‑Based Query Strategies

These strategies select samples that the model is most uncertain about.

Lowest‑confidence method : compute the maximum class probability P(y*|x) for each sample and rank samples by increasing confidence.

Maximum‑entropy sampling : calculate the entropy H(x) = -∑ₖ P(yₖ|x) log P(yₖ|x); higher entropy indicates higher uncertainty.

Margin sampling : compute the difference between the top two class probabilities M(x) = 1 - (P(y*|x) - P(y**|x)). Larger margin values correspond to more uncertain samples.

An illustrative example shows two cat images, one of which wears glasses; all three criteria would select the second image because the classifier’s probability distribution is most ambiguous for that sample.

Representativeness‑Based Query Strategies

These strategies aim to choose samples that best represent the entire dataset, avoiding redundancy.

K‑means clustering : cluster the feature space into k groups and select a few samples from each cluster.

Core‑set (k‑center) selection : solve the k‑center problem to pick a subset whose maximum distance to any data point is minimized.

Hybrid Strategy – DBAL (Diverse Mini‑batch Active Learning)

DBAL combines uncertainty filtering with weighted K‑means clustering. The algorithm proceeds as follows:

Compute an uncertainty score (margin) for every unlabeled sample.

Pre‑filter the top β·k samples with the highest scores.

Run weighted K‑means on the filtered set, using the uncertainty scores as sample weights.

Select the sample closest to each cluster centroid as the final query set.

Reference: https://arxiv.org/abs/2009.00236

Code Implementation (Python 3.9, ResNet‑18)

Below is a self‑contained implementation of DBAL. It imports the necessary libraries, defines the DBAL class, and provides a query method that returns the selected images.

#!/usr/bin/python3
import os
import torch
import numpy as np
from torchvision import transforms
from torchvision.models import resnet18, ResNet18_Weights
from PIL import Image
from sklearn.cluster import KMeans

class DBAL:
    """Active learning query strategy based on DBAL"""
    def __init__(self, k: int = 20, beta: int = 10, image_size: int = 224, use_weighted_kmeans: bool = True):
        self.k = int(k)          # number of queried samples
        self.beta = beta         # pre‑filter factor
        self.image_size = image_size
        self.use_weighted_kmeans = use_weighted_kmeans
        # load pretrained ResNet‑18 for classification
        self.resnet = resnet18(weights=ResNet18_Weights.DEFAULT)
        self.resnet.eval()
        # load pretrained ResNet‑18 as feature extractor (remove final FC)
        self.resnet_features = resnet18(weights=ResNet18_Weights.DEFAULT)
        self.resnet_features.fc = torch.nn.Identity()
        self.resnet_features.eval()
        # image preprocessing
        self.transform = transforms.Compose([
            transforms.Resize(self.image_size),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])
        ])

    def query(self, image_path: str) -> list[str]:
        """Run DBAL active learning strategy. Returns list of selected images in image_path"""
        files = os.listdir(image_path)
        margins = []
        X = np.zeros((len(files), 512), dtype=np.float32)  # feature matrix
        for i, file in enumerate(files):
            # iterate through the unlabeled pool
            img = Image.open(os.path.join(image_path, file))
            img = self.transform(img)
            img = torch.unsqueeze(img, 0)
            predictions = self.resnet(img)               # model output
            features = self.resnet_features(img)          # feature vector
            X[i, :] = features.detach().numpy()
            probabilities = torch.nn.functional.softmax(predictions, dim=1)[0]
            P = sorted(probabilities.tolist())
            # margin = 1 - (top‑prob – second‑top‑prob)
            margins.append(1 - (P[-1] - P[-2]))
        # ---- Pre‑filter to top β·k informative examples ----
        indices = np.argsort(margins)
        keep_idx = []
        for i in range(-1, -self.k * self.beta - 1, -1):
            keep_idx.append(indices[i])
        X = X[keep_idx, :]
        margins = [margins[i] for i in keep_idx]
        files = [files[i] for i in keep_idx]
        # ---- Weighted K‑means clustering ----
        kmeans = KMeans(n_clusters=self.k, random_state=0, n_init=1,
                        tol=1e-4, verbose=1).fit(X, sample_weight=margins)
        distances = kmeans.transform(X)
        # ---- Select one sample closest to each centroid ----
        images = []
        for column in range(self.k):
            idx = np.argmin(distances[:, column])
            im = Image.open(os.path.join(image_path, files[idx]))
            im.filename = str(files[idx])
            images.append(im)
        return images

Experimental Results

The DBAL implementation was run on the Oxford‑IIIT Pet dataset using a pretrained ResNet‑18. With k=20 and beta=10, the unlabeled pool was first reduced to 200 images, then 20 images were selected for expert labeling. The margin scores of the selected images ranged from 0.97 to 0.99, indicating high uncertainty, and the chosen samples covered diverse cat and dog breeds.

Conclusion

Active learning reduces the labeling burden by focusing on the most informative or representative samples. Query strategies based on informativeness (uncertainty), representativeness (clustering), or a hybrid of both (e.g., DBAL) can be implemented straightforwardly in Python using pretrained deep models and standard clustering libraries.

Pythondeep learningactive learninguncertainty samplingResNet18k-means clusteringDBAL
AI Algorithm Path
Written by

AI Algorithm Path

A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.