Network Intelligence Research Center (NIRC)
Mar 10, 2025 · Artificial Intelligence
Revisiting Knowledge Distillation for Autoregressive Language Models
The article analyzes why larger teacher models can hurt student performance in autoregressive language model distillation, reveals that different tokens require distinct teaching modes, proposes an Adaptive Token‑wise Knowledge Distillation (ATKD) method, and shows through extensive experiments that ATKD consistently improves accuracy by about 3 % and enhances generalization across model sizes.
adaptive teachingautoregressive language modelsknowledge distillation
0 likes · 9 min read
