Boost Data Annotation Efficiency with VAPAL: Active Learning Meets Virtual Adversarial Perturbation
This article explains how a pool‑based active learning framework that combines uncertainty sampling (using BADGE, ALPS, or virtual adversarial perturbations) with diversity‑driven clustering can dramatically cut labeling costs for Transformer‑based NLP models, and presents experimental results showing VAPAL’s competitive performance and early‑stage advantages.
Business Background
Transformer‑based natural language processing models achieve strong results in industry, but they require large amounts of labeled data, which is costly and time‑consuming. Reducing annotation cost is therefore a critical problem, and active learning (AL) is adopted as a key technique to improve labeling efficiency.
Solution Overview
2.1 Common Considerations: Uncertainty and Diversity
In our pool‑based AL scenario we select a batch of unlabeled samples from an existing data pool, have them annotated by an oracle, and add them to the training set iteratively.
The two most common query strategies are:
Uncertainty Sampling – selecting samples the model is least confident about (least‑confident, smallest‑margin, entropy).
Diversity Sampling – selecting samples that best represent the overall data distribution, often via clustering.
2.2 Can We Have Both?
Recent work combines uncertainty and diversity by first converting uncertainty into a computable representation (e.g., BADGE or ALPS) and then clustering these representations to ensure diverse selection.
BADGE uses gradient‑based uncertainty representations.
ALPS leverages the loss of a masked language model as an uncertainty signal.
2.3 Leveraging Virtual Adversarial Perturbation for Uncertainty
To improve robustness and handle noisy samples, we adopt Virtual Adversarial Perturbation (VAP) originally used in image tasks. We generate VAP for BERT’s hidden states, defining a new uncertainty measure and proposing the VAPAL algorithm, which follows the same uncertainty‑plus‑clustering pipeline.
VAPAL algorithm flow:
Experimental Validation
We evaluated VAPAL on the English datasets PUBMED and SST‑2.
Results show:
On public benchmarks VAPAL achieves performance comparable to state‑of‑the‑art BADGE and ALPS, making it a competitive AL candidate.
VAPAL performs better in the early stages, indicating superior efficiency when annotation resources are extremely limited.
The VAPAL algorithm has been integrated into our internal human‑in‑the‑loop annotation platform, delivering up to 10× speed‑up and significant gains in label quality and topic granularity.
Conclusion
By introducing active learning and a virtual adversarial perturbation‑based uncertainty measure, we substantially improved data‑annotation efficiency for our NLP services. VAPAL matches or exceeds current best methods, especially in low‑resource phases, and future work will address seed‑sensitivity of selection strategies.
References
Scheffer, T., Decomain, C., & Wrobel, S. (2001). Active hidden Markov models for information extraction. LNCS, 2189, 309–318.
Ido, D., & Sean, P. (1995). Committee‑Based Sampling For Training Probabilistic Classifiers. ML Workshop.
Culotta, A., & McCallum, A. (2005). Reducing labeling effort for structured prediction tasks. AAAI.
Sampling, C. U., et al. Human‑in‑the‑Loop Machine Learning.
Settles, B. (2009). Active Learning Literature Survey.
Miyato, T., Maeda, S. I., Koyama, M., & Ishii, S. (2019). Virtual Adversarial Training. IEEE TPAMI.
Ash, J. T., Zhang, C., et al. (2019). Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds.
Yuan, M., Lin, H.-T., & Boyd‑Graber, J. (2020). Cold‑start Active Learning through Self‑supervised Language Modeling.
Zhang, H., Zhang, Z., Jiang, H., & Song, Y. (2022). Uncertainty Sentence Sampling by Virtual Adversarial Perturbation.
Zuoyebang Tech Team
Sharing technical practices from Zuoyebang
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.