Why Attention Transfer Fails for DINOv2 and Other Modern ViTs: Architecture Mismatch Revealed
A large-scale benchmark of 20 pretrained ViT teachers across 11 families shows that attention copy and distillation improve some models but hurt others—especially DINOv2, CLIP, and BEiTv2—due to architecture mismatches, and adding the teachers' native components to students restores the lost performance.
Core Issue: Attention Transfer Is Not a Universal Remedy
Knowledge distillation often assumes that copying a teacher's attention maps to a randomly initialized student can fully transfer the pretrained advantages of Vision Transformers (ViTs). Earlier work on a few models suggested this might be true.
However, the authors evaluated 20 pretrained teacher models from 11 well‑known ViT families. Two attention‑transfer methods—Attention Copy and Attention Distillation—were compared against a no‑transfer baseline. The results, visualized in Figure 1, show a clear split: seven families gain positive Top‑1 accuracy, while four families (including DINOv2, CLIP, BEiTv2) suffer a drop of about 5.1% relative to training from scratch.
Why the Failure Persists Across Settings
Consistent Failure Across Epochs, Datasets, and OOD Scenarios
Extending training to 100 epochs confirmed that successful families (DeiT‑S, DINO‑S) retain a modest gain, while failing families (DINOv2‑S, DINOv2‑wr‑S) start negative and never recover. The same pattern appears on three yearly iNaturalist splits (+15.2 % / +19.4 % for DeiT/DINO vs. –1.2 % / –2.2 % for DINOv2) and on four out‑of‑distribution benchmarks (IN‑A, IN‑R, IN‑S, IN‑V2), where DINOv2‑based models consistently underperform the baseline.
Component‑Level Diagnosis: Attention Path Is the Culprit
To locate the failure, the authors performed a component‑wise ablation on three representative teachers (DeiT‑S, DINOv2‑S, CLIP‑B). They selectively re‑initialized only the attention module or only the MLP module while keeping the rest of the pretrained weights.
Re‑initializing only the MLP yields a stable gain (+5.7 % ~ +13.3 %) for all families. Re‑initializing only the attention module preserves the success/failure split: it improves DeiT‑S but harms DINOv2‑S and CLIP‑B. This demonstrates that the failure is rooted in the attention pathway.
Layer‑wise Analysis Confirms Broad Failure
Further experiments transferred k attention layers either from the bottom or the top of DINOv2‑S. Every tested subset produced lower accuracy than the no‑transfer baseline, with top‑layer transfers causing larger drops. Thus the problem is not confined to a few layers but spans the entire attention stack.
Root Cause: Architecture Mismatch Between Teacher and Student
Modern ViT families (DINOv2, CLIP, BEiTv2) incorporate extra components such as LayerScale, Pre‑LayerNorm, and relative positional bias, whereas the standard student ViT uses the simplest architecture without these features. When the teacher’s attention patterns are forced onto a mismatched student, the attention‑routing channels become misaligned.
To test this hypothesis, the authors replaced the standard student architecture with the teacher’s native architecture (keeping the extra components randomly initialized) and repeated attention transfer. All four previously failing families switched from negative to positive gains (Figure 4).
Importantly, the added components alone do not carry pretrained knowledge; they are randomly initialized. A control experiment showed that a native‑architecture student trained from scratch underperforms the standard student by 2.3 % ~ 3.0 % absolute, indicating that the performance boost stems from unlocking the teacher’s attention patterns rather than from extra capacity.
Alternative Explanations Ruled Out
The authors examined whether the loss function or the pretraining recipe could explain the failure. Experiments varying the distillation loss (MSE vs. CE) and scanning the loss weight λ showed the same monotonic trends: successful teachers improve with larger λ, while failing teachers degrade further, regardless of loss type. Additional tests with JSD and L1 losses yielded identical patterns, confirming that loss choice is not the cause.
Similarly, the authors compared pretraining signals, data sources, and special attributes (e.g., self‑distilled DINOv2 vs. successful DINO, multimodal CLIP vs. successful SigLIP2, masked image modeling BEiTv2 vs. iBOT/MAE). None of these factors correlated with the observed failure, as summarized in Table 6.
Objective Evaluation and Limitations
The study is limited to softmax‑to‑softmax attention‑transfer scenarios and classification tasks. It does not address full‑weight initialization, feature‑level distillation, or parameter‑efficient fine‑tuning. Transfer to dense prediction or vision‑language tasks may require separate validation. The principle of architecture compatibility—whether the student preserves the teacher’s attention‑routing context—remains to be examined for cross‑size transfers (large teacher → small student).
Takeaways for Practitioners
Do not assume attention transfer works universally. Verify that teacher and student architectures match before applying the technique.
Simple rescue: add the teacher’s native components (LayerScale, Pre‑LayerNorm, etc.) to the student, even with random initialization, to recover performance.
Rethink distillation design: prioritize architecture compatibility over merely strengthening the distillation signal.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
