Artificial Intelligence 10 min read

Boost Data Annotation Efficiency with VAPAL: Active Learning Meets Virtual Adversarial Perturbation

This article explains how a pool‑based active learning framework that combines uncertainty sampling (using BADGE, ALPS, or virtual adversarial perturbations) with diversity‑driven clustering can dramatically cut labeling costs for Transformer‑based NLP models, and presents experimental results showing VAPAL’s competitive performance and early‑stage advantages.

Zuoyebang Tech Team

Nov 9, 2022

Boost Data Annotation Efficiency with VAPAL: Active Learning Meets Virtual Adversarial Perturbation

Business Background

Transformer‑based natural language processing models achieve strong results in industry, but they require large amounts of labeled data, which is costly and time‑consuming. Reducing annotation cost is therefore a critical problem, and active learning (AL) is adopted as a key technique to improve labeling efficiency.

Solution Overview

2.1 Common Considerations: Uncertainty and Diversity

In our pool‑based AL scenario we select a batch of unlabeled samples from an existing data pool, have them annotated by an oracle, and add them to the training set iteratively.

The two most common query strategies are:

Uncertainty Sampling – selecting samples the model is least confident about (least‑confident, smallest‑margin, entropy).

Diversity Sampling – selecting samples that best represent the overall data distribution, often via clustering.

2.2 Can We Have Both?

Recent work combines uncertainty and diversity by first converting uncertainty into a computable representation (e.g., BADGE or ALPS) and then clustering these representations to ensure diverse selection.

BADGE uses gradient‑based uncertainty representations.

ALPS leverages the loss of a masked language model as an uncertainty signal.

2.3 Leveraging Virtual Adversarial Perturbation for Uncertainty

To improve robustness and handle noisy samples, we adopt Virtual Adversarial Perturbation (VAP) originally used in image tasks. We generate VAP for BERT’s hidden states, defining a new uncertainty measure and proposing the VAPAL algorithm, which follows the same uncertainty‑plus‑clustering pipeline.

VAPAL algorithm flow:

Experimental Validation

We evaluated VAPAL on the English datasets PUBMED and SST‑2.

Results show:

On public benchmarks VAPAL achieves performance comparable to state‑of‑the‑art BADGE and ALPS, making it a competitive AL candidate.

VAPAL performs better in the early stages, indicating superior efficiency when annotation resources are extremely limited.

The VAPAL algorithm has been integrated into our internal human‑in‑the‑loop annotation platform, delivering up to 10× speed‑up and significant gains in label quality and topic granularity.

Conclusion

By introducing active learning and a virtual adversarial perturbation‑based uncertainty measure, we substantially improved data‑annotation efficiency for our NLP services. VAPAL matches or exceeds current best methods, especially in low‑resource phases, and future work will address seed‑sensitivity of selection strategies.

References

Scheffer, T., Decomain, C., & Wrobel, S. (2001). Active hidden Markov models for information extraction. LNCS, 2189, 309–318.

Ido, D., & Sean, P. (1995). Committee‑Based Sampling For Training Probabilistic Classifiers. ML Workshop.

Culotta, A., & McCallum, A. (2005). Reducing labeling effort for structured prediction tasks. AAAI.

Sampling, C. U., et al. Human‑in‑the‑Loop Machine Learning.

Settles, B. (2009). Active Learning Literature Survey.

Miyato, T., Maeda, S. I., Koyama, M., & Ishii, S. (2019). Virtual Adversarial Training. IEEE TPAMI.

Ash, J. T., Zhang, C., et al. (2019). Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds.

Yuan, M., Lin, H.-T., & Boyd‑Graber, J. (2020). Cold‑start Active Learning through Self‑supervised Language Modeling.

Zhang, H., Zhang, Z., Jiang, H., & Song, Y. (2022). Uncertainty Sentence Sampling by Virtual Adversarial Perturbation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning NLP Active Learning data annotation virtual adversarial perturbation

Written by

Zuoyebang Tech Team

Sharing technical practices from Zuoyebang

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.