Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Mar 10, 2025 · Artificial Intelligence

Revisiting Knowledge Distillation for Autoregressive Language Models

The article analyzes why larger teacher models can hurt student performance in autoregressive language model distillation, reveals that different tokens require distinct teaching modes, proposes an Adaptive Token‑wise Knowledge Distillation (ATKD) method, and shows through extensive experiments that ATKD consistently improves accuracy by about 3 % and enhances generalization across model sizes.

adaptive teachingautoregressive language modelsknowledge distillation
0 likes · 9 min read
Revisiting Knowledge Distillation for Autoregressive Language Models