Artificial Intelligence 13 min read

An Introduction to Knowledge Distillation for Model Compression

This article explains the AI model‑compression technique of knowledge distillation, describing how a large teacher network transfers its soft predictions to a lightweight student network using temperature‑scaled softmax, enabling deployment on resource‑constrained devices.

Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
An Introduction to Knowledge Distillation for Model Compression

Click Follow to Participate End-of-Article Lottery

During deep learning model training and deployment, many encounter difficulties due to large model size; this article introduces an important AI model compression technique—knowledge distillation.

Knowledge distillation is a teacher‑student training method that transfers knowledge from a large teacher model to a lightweight student network, enabling model compression and deployment. It is widely used in computer vision, NLP, multimodal learning, and large pre‑trained models.

Just as distilling impure water yields pure water, knowledge distillation extracts and concentrates the teacher’s knowledge into a smaller student model.

1. Introduction to Knowledge Distillation

Understanding knowledge distillation requires basic machine‑learning concepts such as cross‑entropy loss, softmax, and gradient‑based training; a recommended bilingual video by Andrew Ng is linked.

The process can be viewed as extracting knowledge from a bulky, high‑performance teacher network and transferring it to a compact student network.

In other words, a large, high‑capacity teacher network teaches its knowledge to a small, lightweight student network.

The teacher network passes knowledge to the student network—a process called distillation or transfer. If the student learns well, it can replace the teacher in resource‑constrained environments.

Why make the network small? Large models are too heavy for deployment on limited‑capacity devices such as mobile phones, wearables, TVs, autonomous vehicles, and surveillance cameras.

Thus, knowledge distillation aims to shrink large models for deployment on edge devices.

2. Core Principles of Knowledge Distillation

Foundational paper: https://arxiv.org/pdf/1503.02531.pdf (highly recommended)

The seminal paper, authored by Hinton and others, not only created the field but also offers exemplary writing style.

It illustrates the mismatch between training (large models) and deployment (fast, lightweight models).

2.1 Knowledge Representation and Transfer

When feeding an image of a horse to a classifier, the network outputs probabilities for many classes (horse, cat, dog, etc.).

Training with only the correct class yields hard targets , which assign probability 1 to the correct class and 0 to others—an oversimplified view.

Soft targets, however, provide a distribution (e.g., horse 0.7, donkey 0.25, car 0.05), conveying richer relational information.

Therefore, we train the teacher with hard targets, then use its soft targets to train the student.

Soft targets can be softened further by introducing a temperature T; higher T makes the distribution flatter, exposing more information about non‑correct classes.

2.2 Distillation Temperature T

Temperature T is applied by dividing the logits by T before the softmax.

When T = 1, the softmax is unchanged; larger T produces softer probabilities, while an excessively large T yields a uniform distribution.

Increasing T therefore transforms hard labels into softer ones, allowing the student to learn richer knowledge.

2.3 Knowledge Distillation Process

The overall procedure:

Obtain a pretrained teacher network and feed data to obtain softmax outputs at temperature t.

Feed the same data to the student network (which may be untrained) and obtain its softmax at the same temperature t.

Compute a loss between the teacher’s and student’s softmax (distillation loss) to align them.

Compute a standard cross‑entropy loss between the student’s softmax at T = 1 and the ground‑truth hard labels.

Combine the distillation loss and the hard‑label loss (weighted sum) as the final objective.

The goal is to fine‑tune the student’s weights via gradient descent and back‑propagation to minimize the combined loss.

ending

With this understanding, the next article will present code examples and additional details.

If you found this helpful, please like, share, and follow.

Artificial Intelligencedeep learningmodel compressionknowledge distillationsoft targetsteacher-student network
Rare Earth Juejin Tech Community
Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.