Unveiling the Law of Capacity Gap: Boosting Language Model Distillation Efficiency
At ACL 2025, a collaborative paper introduced the Law of Capacity Gap, revealing a linear 2.5× optimal teacher‑student size relationship in language model distillation, dramatically cutting compute costs and achieving Pareto‑optimal efficiency, with the MiniMA model as a successful demonstration.
Paper Award and Introduction
At the ACL 2025 conference, the paper “Towards the Law of Capacity Gap in Distilling Language Models” by the Xiaohongshu AI Search team and Prof. Dawei Song’s group received the Outstanding Paper Award.
Law of Capacity Gap
The study defines the “Law of Capacity Gap”, showing that the optimal teacher model size is approximately 2.5 times the student model size, turning the previously observed “curse of capacity gap” into a predictable linear rule and reducing the computational cost of finding the best teacher.
Understanding the Curse
When the size gap between teacher and student models is too large, the best‑performing student does not necessarily come from the largest teacher, leading to expensive enumeration of teacher models.
Experimental Methodology
Select a pre‑trained language model of a specific scale as the teacher.
Prune the teacher to a target sparsity to create a student model.
Perform knowledge distillation from teacher to student.
Observe the relationship between student performance and teacher scale.
Key Findings
Across many small‑scale experiments, the optimal teacher size consistently equals about 2.5 × the student size, as illustrated in the following diagram.
Validation on Large Models
To test the law’s scalability, MiniMA‑3B was distilled from LLaMA‑7B, LLaMA‑13B, and LLaMA‑70B. The best results were obtained using LLaMA‑7B as the teacher, confirming the 2.5× rule.
The law‑guided distillation achieved state‑of‑the‑art compute‑efficiency and performance trade‑offs, reaching the Pareto frontier for small language models.
Resources
GitHub repository: https://github.com/GeneZC/MiniMA
HuggingFace model: https://huggingface.co/GeneZC/MiniMA-3B
ArXiv paper: https://arxiv.org/abs/2311.07052
Xiaohongshu Tech REDtech
Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
