Revisiting Knowledge Distillation for Autoregressive Language Models

The article analyzes why larger teacher models can hurt student performance in autoregressive language model distillation, reveals that different tokens require distinct teaching modes, proposes an Adaptive Token‑wise Knowledge Distillation (ATKD) method, and shows through extensive experiments that ATKD consistently improves accuracy by about 3 % and enhances generalization across model sizes.

Knowledge Distillationadaptive teachingautoregressive language models

0 likes · 9 min read

Revisiting Knowledge Distillation for Autoregressive Language Models