Low‑Resource NLP Pretraining: Methodology, Experiments, and Zero‑Shot Applications
This article presents a low‑resource NLP pretraining approach that combines transformer‑based language modeling with contrastive vector learning, details the unsupervised sample‑pair construction, introduces a camel‑shaped masking distribution, and demonstrates through extensive experiments that the resulting model achieves strong zero‑shot NLU, NLG, and retrieval performance while requiring minimal compute and data.
The talk, delivered by Baidu NLP algorithm engineer Xue Changshang, introduces recent explorations in NLP pretraining and downstream task practice, focusing on a low‑resource pretraining method that can be launched with a single GPU and modest data.
Pretraining is justified by the high cost of maintaining many task‑specific models; a unified pretraining stage can reduce model proliferation, improve zero‑shot NLG, NLU, and vector‑based inference, and provide better recall than traditional BM25.
The proposed architecture extends a Transformer encoder‑decoder with an additional vector‑representation head. Training jointly optimizes language modeling and contrastive learning using the loss Total Loss = LM Loss + α CL Loss , where α balances the two objectives.
Positive sample pairs are mined unsupervised by splitting documents into sentences, enumerating pairs, and selecting those with a longest common substring (LCS) above a threshold, emphasizing relevance over semantic equivalence; frequency caps and reverse‑cloze augmentation further enrich the pair set.
Masking is adapted for short, fragmented sentences by replacing the standard geometric distribution with a camel‑shaped distribution that gives higher probability to the most suitable mask length, improving robustness on short‑sentence corpora.
Experiments compare several variants (GUR‑FULL, UR‑LCS, UR‑CL, GUR‑LM, NLPC) using a T5‑small backbone continued on Wikipedia, WikiBooks, CSL, and Baidu’s own noisy corpus. Results show that models with vector contrast learning consistently outperform BM25 in retrieval and achieve strong zero‑shot and few‑shot NLU/NLG performance, with the GUR model excelling in low‑sample regimes.
The conclusion highlights that the joint training paradigm does not cause objective conflict, enables zero‑shot inference after a single pretraining run, and is suitable for business units seeking cost‑effective NLP solutions, while suggesting future work on larger models and broader applications.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.