Unveiling the Law of Capacity Gap: Boosting Language Model Distillation Efficiency

At ACL 2025, a collaborative paper introduced the Law of Capacity Gap, revealing a linear 2.5× optimal teacher‑student size relationship in language model distillation, dramatically cutting compute costs and achieving Pareto‑optimal efficiency, with the MiniMA model as a successful demonstration.

Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Unveiling the Law of Capacity Gap: Boosting Language Model Distillation Efficiency

Paper Award and Introduction

At the ACL 2025 conference, the paper “Towards the Law of Capacity Gap in Distilling Language Models” by the Xiaohongshu AI Search team and Prof. Dawei Song’s group received the Outstanding Paper Award.

Law of Capacity Gap

The study defines the “Law of Capacity Gap”, showing that the optimal teacher model size is approximately 2.5 times the student model size, turning the previously observed “curse of capacity gap” into a predictable linear rule and reducing the computational cost of finding the best teacher.

Understanding the Curse

When the size gap between teacher and student models is too large, the best‑performing student does not necessarily come from the largest teacher, leading to expensive enumeration of teacher models.

Experimental Methodology

Select a pre‑trained language model of a specific scale as the teacher.

Prune the teacher to a target sparsity to create a student model.

Perform knowledge distillation from teacher to student.

Observe the relationship between student performance and teacher scale.

Key Findings

Across many small‑scale experiments, the optimal teacher size consistently equals about 2.5 × the student size, as illustrated in the following diagram.

Validation on Large Models

To test the law’s scalability, MiniMA‑3B was distilled from LLaMA‑7B, LLaMA‑13B, and LLaMA‑70B. The best results were obtained using LLaMA‑7B as the teacher, confirming the 2.5× rule.

The law‑guided distillation achieved state‑of‑the‑art compute‑efficiency and performance trade‑offs, reaching the Pareto frontier for small language models.

Resources

GitHub repository: https://github.com/GeneZC/MiniMA

HuggingFace model: https://huggingface.co/GeneZC/MiniMA-3B

ArXiv paper: https://arxiv.org/abs/2311.07052

Distillationlanguage modelscapacity-gapMiniMAartificial-intelligencescaling-law
Xiaohongshu Tech REDtech
Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.