Can Multi‑Teacher Distillation Overcome Catastrophic Forgetting in Continual Learning?

This paper proposes a multi‑teacher distillation framework for continual learning that combines active data rehearsal with feature‑decoupled distillation, demonstrating superior performance on PASCAL VOC and COCO benchmarks while mitigating catastrophic forgetting and balancing stability‑plasticity trade‑offs.

AsiaInfo Technology: New Tech Exploration
AsiaInfo Technology: New Tech Exploration
AsiaInfo Technology: New Tech Exploration
Can Multi‑Teacher Distillation Overcome Catastrophic Forgetting in Continual Learning?

Abstract

Continual learning (also called lifelong learning) aims to enable deep neural networks to acquire new tasks without catastrophically forgetting previously learned knowledge. The authors propose Multi‑Teacher Distillation (MTD) , which combines active data rehearsal with multi‑teacher knowledge distillation to improve both retention of old classes and acquisition of new ones.

1. Background

Continual‑learning methods are usually classified into three groups:

Sample‑replay: store a subset of old data and train jointly with new data.

Parameter‑regularization: constrain important weights (e.g., EWC, LwF).

Parameter‑isolation: allocate separate parameters per task.

Each approach trades off memory consumption, forgetting mitigation, and computational efficiency.

2. Proposed Method: Multi‑Teacher Distillation (MTD)

MTD consists of two core components.

Active Data Rehearsal (A) : an active‑learning selector chooses a compact set of representative samples from the historical dataset. Selection is based on a minimal feature‑map structural similarity criterion, which reduces storage and computation while preserving the most informative old examples.

Multi‑Teacher Distillation (B) : a student detector is trained under the guidance of two teachers – a base model trained on old tasks and an expert model trained on the new task. Feature‑decoupled distillation computes three region‑wise losses (base‑target, new‑target, background) and adds them to the standard detection loss.

Training Pipeline

Select representative rehearsal samples via the active‑learning criterion.

Compute the multi‑teacher distillation loss using the base and expert models.

Compute the conventional detection loss between student predictions and ground‑truth labels.

Back‑propagate the summed loss to update the student model.

MTD training pipeline
MTD training pipeline

Active Data Rehearsal Details

The selector evaluates the structural similarity between feature maps of candidate samples and the current feature distribution. The most diverse samples are kept as the rehearsal set, denoted R. This set is merged with the new task data D_new to form the final training batch D = R \cup D_new.

Multi‑Teacher Distillation Formulation

L_total = L_det + \alpha L_{base} + \beta L_{expert} + \gamma L_{bg}

where L_det is the standard detection loss, L_{base} and L_{expert} are L2 distances between student features and the base/expert teacher features on their respective target regions, and L_{bg} handles background regions. The coefficients \alpha, \beta, \gamma balance the three distillation terms.

3. Experiments

MTD is evaluated on two incremental object‑detection benchmarks.

PASCAL VOC 2007

20 classes are split into three incremental settings: 10+10, 15+5, and 19+1. Results (mean average precision, mAP) are:

10+10: 69.0% (old) / 69.9% (new)

15+5: 71.2% (old) / 59.6% (new)

19+1: 74.3% (old) / 73.2% (new)

All settings outperform prior continual‑learning baselines.

Microsoft COCO 2017

Using a 40+40 split (first 40 classes, then 40 new classes), MTD achieves 32.2% overall mAP, surpassing existing methods.

Ablation Study

Three distillation configurations are compared:

MTD:B : only the base model acts as teacher.

MTD:E : only the expert model acts as teacher.

MTD:B+E : both teachers are used (full MTD).

Across all settings, the combined multi‑teacher setup (MTD:B+E) yields the highest mAP (e.g., 71.2% in the 15+5 scenario), confirming the benefit of leveraging both old‑knowledge and new‑knowledge teachers.

Ablation results
Ablation results

4. Conclusion

MTD effectively mitigates catastrophic forgetting while preserving inference efficiency. The framework is compatible with any convolutional backbone and is suitable for edge‑device deployment. Future work includes scaling the distillation scheme to large‑scale and multimodal models and exploring real‑world deployments in smart campuses, logistics, and power‑plant monitoring.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIobject detectionContinual Learningknowledge distillationactive rehearsalCatastrophic Forgettingmulti-teacher
AsiaInfo Technology: New Tech Exploration
Written by

AsiaInfo Technology: New Tech Exploration

AsiaInfo's cutting‑edge ICT viewpoints and industry insights, featuring its latest technology and product case studies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.