9 min read

Revisiting Knowledge Distillation for Autoregressive Language Models

The article analyzes why larger teacher models can hurt student performance in autoregressive language model distillation, reveals that different tokens require distinct teaching modes, proposes an Adaptive Token‑wise Knowledge Distillation (ATKD) method, and shows through extensive experiments that ATKD consistently improves accuracy by about 3 % and enhances generalization across model sizes.

Network Intelligence Research Center (NIRC)

Mar 10, 2025

Revisiting Knowledge Distillation for Autoregressive Language Models

Background

Autoregressive language models such as GPT and LLaMA achieve strong results but become costly to run as they scale. Knowledge distillation (KD) trains a smaller student to mimic a larger teacher, reducing inference cost. Recent works (e.g., f‑DISTILL, GKD) propose new KD algorithms for these models.

Analysis and Findings

Experiments demonstrate a counter‑intuitive phenomenon: when the teacher is too large, the student’s performance can drop sharply, especially when the teacher‑student capacity gap is large. The authors attribute this to the fact that different tokens have different learning difficulties, and a single, uniform teaching mode is sub‑optimal.

The study decomposes the classic token‑level KD loss (forward KL) into two components:

Target‑oriented KD (TKD) : forces the student to learn information related to the correct token.

Diversity‑oriented KD (DKD) : encourages the student to absorb diverse knowledge from non‑target tokens.

These components are combined by a token‑wise factor called the uncertainty coefficient (UNC) , which reflects the teacher’s uncertainty for each token.

Theoretical Analysis

The authors formalize the classic KD objective as a sum of a binary loss for the target class and a KL loss for the non‑target distribution, linked by the UNC factor. They argue that UNC measures token learning difficulty: hard‑to‑learn tokens have higher UNC and should receive more attention.

ATKD Method

To address the limitation of a uniform teaching mode, the paper proposes Adaptive Token‑wise Knowledge Distillation (ATKD). ATKD ranks tokens in each mini‑batch by UNC, treats the top‑k hardest tokens as “difficult” and the rest as “easy”. For easy tokens, ATKD skips TKD and relies only on DKD; for difficult tokens, it applies both TKD and DKD. This decouples the two losses and prevents the teacher’s uncertainty from suppressing DKD on hard tokens.

The overall ATKD objective can be expressed as a weighted combination of TKD and DKD where the weights are determined by UNC‑based token difficulty.

ATKD Evaluation

ATKD mitigates the performance drop observed with larger teachers. For example, when distilling OPT, a 1.3B student reaches 40.00 % accuracy with a 6.7B teacher, but performance falls to 38.73 % without ATKD.

Across model sizes and architectures (OPT, Pythia, LLaMA), ATKD yields stable and significant gains, with an average improvement of up to +3.04 %.

ATKD benefits all baseline KD methods; e.g., Revere KD and ImitKD gain +1.80 % and +1.36 % respectively when combined with ATKD.

Conclusion

The work uncovers a limitation of applying the same teaching mode to all tokens during KD of large autoregressive teachers and shows that ignoring token‑wise difficulty leads to sub‑optimal performance. The proposed ATKD algorithm adaptively skips target‑oriented teaching for easy tokens and emphasizes diverse learning for hard tokens, resulting in consistent accuracy improvements and better generalization across diverse language model families.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

model compression knowledge distillation adaptive teaching autoregressive language models token-wise uncertainty

Written by

Network Intelligence Research Center (NIRC)

NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.