Large‑Model and Small‑Model Interaction: Knowledge Distillation and Reverse Distillation Techniques
This article explains how large‑scale NLP models can be paired with smaller models through task‑related and task‑unrelated knowledge distillation, progressive multi‑stage distillation, and reverse distillation, thereby reducing training costs, accelerating inference, and even allowing small models to improve large‑model training via sample‑value assessment.
Recent advances in NLP have produced models with billions to trillions of parameters, making training and deployment prohibitively expensive; for example, increasing a baseline 40‑million‑parameter model to 1.5 billion parameters raises cost by 37×, and to tens of billions by 70‑80×.
To address this, the speaker proposes a size‑model linked learning approach where small models learn from large models via knowledge distillation, achieving comparable downstream performance while being lighter and faster, and where small models can in turn feed back to improve large‑model training.
Knowledge Distillation Basics : A teacher‑student framework where a complex, high‑capacity teacher guides a simpler student model, transferring learned representations to improve the student’s generalization.
Pre‑training Tasks : Consist of (1) pre‑training on massive unlabeled data to obtain a base model, and (2) fine‑tuning on task‑specific labeled data; this two‑stage process is essentially transfer learning.
Distillation can be task‑related (performed during fine‑tuning, using a task‑specific teacher) or task‑unrelated (performed during pre‑training with self‑supervised objectives). Task‑related distillation yields higher performance but incurs higher maintenance cost as many teachers are needed; task‑unrelated distillation is easier to apply but offers a lower performance ceiling.
The proposed progressive distillation scheme extends the traditional two‑stage process by inserting additional intermediate stages, each altering only one element (teacher, data, or objective). For example, stage 2 replaces a generic teacher with a fine‑tuned teacher, stage 3 swaps generic large‑scale data for task‑specific data, and the final stage completes full distillation.
Experimental results on a four‑layer model show a 9.4× speedup over BERT‑Base with comparable accuracy (78 vs 79.6). Multi‑student distillation further allows a single teacher to produce several student models of different sizes in one training pass.
Reverse Distillation : Small models act as teachers for large models during early training (KIPT framework), accelerating convergence. In a dual‑tower matching scenario, adding reverse distillation improves evaluation metrics noticeably.
Sample‑Value Judgment : A small fine‑tuned model estimates the usefulness of each training sample by comparing its loss to that of the large model; high‑value samples (large‑model loss ≫ small‑model loss) are prioritized for gradient updates, reducing training steps while maintaining or improving final performance.
Overall, the talk demonstrates that coupling large and small models through both forward and backward distillation, as well as sample‑value selection, can substantially lower computational costs and enhance model effectiveness.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
