Artificial Intelligence 15 min read

Summary of Methods and Findings from the NLP Chinese Pre‑training Model Generalization Challenge

The article reviews the Chinese NLP pre‑training model generalization competition, detailing data preprocessing, augmentation, external data usage, model scaling and architecture tweaks, loss functions, learning‑rate and adversarial training strategies, regularization techniques, post‑processing optimizations, and ineffective methods, highlighting their impact on performance metrics.

DataFunTalk
DataFunTalk
DataFunTalk
Summary of Methods and Findings from the NLP Chinese Pre‑training Model Generalization Challenge

Introduction

The NLP Chinese Pre‑training Model Generalization Challenge, jointly organized by CLUE, Alibaba Cloud, and Leyan Technology, invited participants to develop models with strong generalization ability rather than simple task‑specific fine‑tuning, aiming to deepen understanding of pre‑training mechanisms.

Data

Data quality and quantity are crucial. Simple preprocessing (emoji replacement, punctuation conversion) contributed about 0.2% improvement. Data augmentation methods such as EDA (random word/character insertions, deletions, swaps) and back‑translation added diversity, also yielding ~0.2% gains. External labeled data and large amounts of unlabeled data used for self‑training raised performance by roughly 1.5%, while semi‑supervised learning with test data added another 0.8%.

Model

Model size proved most impactful: larger pre‑training models (e.g., BERT‑large vs. BERT‑base) improved metrics by ~4.5%, and even larger models (BERT‑xlarge) added another ~3%.

Various architectural tweaks were explored, including using the last four BERT layers plus the pooler, shared‑parameter multi‑task structures with task‑specific queries, task‑specific LSTM/Transformer heads, sentence‑level average pooling instead of CLS, staged training with parameter fusion, and treating all tasks as a single 25‑class problem with task‑type embeddings.

Loss Functions

Different tasks required different losses (cross‑entropy for classification, MSE for regression). Techniques such as label smoothing combined with embedding mixup raised scores by about 0.5%.

Optimization Methods

Learning‑rate scheduling, adversarial training (FGM, PDG), EMA (exponential moving average), and SWA (stochastic weight averaging) were applied, delivering improvements ranging from 0.5% to 1.8%.

Regularization

Regularization strategies—including word mixup, dropout, and early stopping—helped mitigate overfitting, each contributing roughly 0.5% to 1.8% gains.

Post‑processing

Threshold optimization for classification decisions added about 0.4% improvement.

Other Attempts

Several explored methods, such as additional pre‑training on task data, focal loss, soft‑F1 loss, and dynamic loss weighting, showed little or no effect. Training tricks like gradient accumulation and staged training were used to handle large models and speed up convergence.

References

The article lists numerous papers and resources on data preprocessing, augmentation, self‑training, multi‑task learning, loss functions, learning‑rate effects, adversarial training, EMA, SWA, mixup, dropout, early stopping, and self‑knowledge distillation.

data augmentationmodel optimizationNLPPretrainingloss functionsregularization
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.