Summary of Methods and Findings from the NLP Chinese Pre‑training Model Generalization Challenge
The article reviews the Chinese NLP pre‑training model generalization competition, detailing data preprocessing, augmentation, external data usage, model scaling and architecture tweaks, loss functions, learning‑rate and adversarial training strategies, regularization techniques, post‑processing optimizations, and ineffective methods, highlighting their impact on performance metrics.
Introduction
The NLP Chinese Pre‑training Model Generalization Challenge, jointly organized by CLUE, Alibaba Cloud, and Leyan Technology, invited participants to develop models with strong generalization ability rather than simple task‑specific fine‑tuning, aiming to deepen understanding of pre‑training mechanisms.
Data
Data quality and quantity are crucial. Simple preprocessing (emoji replacement, punctuation conversion) contributed about 0.2% improvement. Data augmentation methods such as EDA (random word/character insertions, deletions, swaps) and back‑translation added diversity, also yielding ~0.2% gains. External labeled data and large amounts of unlabeled data used for self‑training raised performance by roughly 1.5%, while semi‑supervised learning with test data added another 0.8%.
Model
Model size proved most impactful: larger pre‑training models (e.g., BERT‑large vs. BERT‑base) improved metrics by ~4.5%, and even larger models (BERT‑xlarge) added another ~3%.
Various architectural tweaks were explored, including using the last four BERT layers plus the pooler, shared‑parameter multi‑task structures with task‑specific queries, task‑specific LSTM/Transformer heads, sentence‑level average pooling instead of CLS, staged training with parameter fusion, and treating all tasks as a single 25‑class problem with task‑type embeddings.
Loss Functions
Different tasks required different losses (cross‑entropy for classification, MSE for regression). Techniques such as label smoothing combined with embedding mixup raised scores by about 0.5%.
Optimization Methods
Learning‑rate scheduling, adversarial training (FGM, PDG), EMA (exponential moving average), and SWA (stochastic weight averaging) were applied, delivering improvements ranging from 0.5% to 1.8%.
Regularization
Regularization strategies—including word mixup, dropout, and early stopping—helped mitigate overfitting, each contributing roughly 0.5% to 1.8% gains.
Post‑processing
Threshold optimization for classification decisions added about 0.4% improvement.
Other Attempts
Several explored methods, such as additional pre‑training on task data, focal loss, soft‑F1 loss, and dynamic loss weighting, showed little or no effect. Training tricks like gradient accumulation and staged training were used to handle large models and speed up convergence.
References
The article lists numerous papers and resources on data preprocessing, augmentation, self‑training, multi‑task learning, loss functions, learning‑rate effects, adversarial training, EMA, SWA, mixup, dropout, early stopping, and self‑knowledge distillation.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.