Why Attention Transfer Fails for DINOv2 and Other Modern ViTs: Architecture Mismatch Revealed

A large-scale benchmark of 20 pretrained ViT teachers across 11 families shows that attention copy and distillation improve some models but hurt others—especially DINOv2, CLIP, and BEiTv2—due to architecture mismatches, and adding the teachers' native components to students restores the lost performance.

AIWalker
AIWalker
AIWalker
Why Attention Transfer Fails for DINOv2 and Other Modern ViTs: Architecture Mismatch Revealed

Core Issue: Attention Transfer Is Not a Universal Remedy

Knowledge distillation often assumes that copying a teacher's attention maps to a randomly initialized student can fully transfer the pretrained advantages of Vision Transformers (ViTs). Earlier work on a few models suggested this might be true.

However, the authors evaluated 20 pretrained teacher models from 11 well‑known ViT families. Two attention‑transfer methods—Attention Copy and Attention Distillation—were compared against a no‑transfer baseline. The results, visualized in Figure 1, show a clear split: seven families gain positive Top‑1 accuracy, while four families (including DINOv2, CLIP, BEiTv2) suffer a drop of about 5.1% relative to training from scratch.

Figure 1
Figure 1

Why the Failure Persists Across Settings

Consistent Failure Across Epochs, Datasets, and OOD Scenarios

Extending training to 100 epochs confirmed that successful families (DeiT‑S, DINO‑S) retain a modest gain, while failing families (DINOv2‑S, DINOv2‑wr‑S) start negative and never recover. The same pattern appears on three yearly iNaturalist splits (+15.2 % / +19.4 % for DeiT/DINO vs. –1.2 % / –2.2 % for DINOv2) and on four out‑of‑distribution benchmarks (IN‑A, IN‑R, IN‑S, IN‑V2), where DINOv2‑based models consistently underperform the baseline.

Table 1
Table 1
Table 2
Table 2

Component‑Level Diagnosis: Attention Path Is the Culprit

To locate the failure, the authors performed a component‑wise ablation on three representative teachers (DeiT‑S, DINOv2‑S, CLIP‑B). They selectively re‑initialized only the attention module or only the MLP module while keeping the rest of the pretrained weights.

Table 3
Table 3

Re‑initializing only the MLP yields a stable gain (+5.7 % ~ +13.3 %) for all families. Re‑initializing only the attention module preserves the success/failure split: it improves DeiT‑S but harms DINOv2‑S and CLIP‑B. This demonstrates that the failure is rooted in the attention pathway.

Layer‑wise Analysis Confirms Broad Failure

Further experiments transferred k attention layers either from the bottom or the top of DINOv2‑S. Every tested subset produced lower accuracy than the no‑transfer baseline, with top‑layer transfers causing larger drops. Thus the problem is not confined to a few layers but spans the entire attention stack.

Figure 3
Figure 3

Root Cause: Architecture Mismatch Between Teacher and Student

Modern ViT families (DINOv2, CLIP, BEiTv2) incorporate extra components such as LayerScale, Pre‑LayerNorm, and relative positional bias, whereas the standard student ViT uses the simplest architecture without these features. When the teacher’s attention patterns are forced onto a mismatched student, the attention‑routing channels become misaligned.

To test this hypothesis, the authors replaced the standard student architecture with the teacher’s native architecture (keeping the extra components randomly initialized) and repeated attention transfer. All four previously failing families switched from negative to positive gains (Figure 4).

Figure 4
Figure 4

Importantly, the added components alone do not carry pretrained knowledge; they are randomly initialized. A control experiment showed that a native‑architecture student trained from scratch underperforms the standard student by 2.3 % ~ 3.0 % absolute, indicating that the performance boost stems from unlocking the teacher’s attention patterns rather than from extra capacity.

Table 5
Table 5

Alternative Explanations Ruled Out

The authors examined whether the loss function or the pretraining recipe could explain the failure. Experiments varying the distillation loss (MSE vs. CE) and scanning the loss weight λ showed the same monotonic trends: successful teachers improve with larger λ, while failing teachers degrade further, regardless of loss type. Additional tests with JSD and L1 losses yielded identical patterns, confirming that loss choice is not the cause.

Figure 6
Figure 6

Similarly, the authors compared pretraining signals, data sources, and special attributes (e.g., self‑distilled DINOv2 vs. successful DINO, multimodal CLIP vs. successful SigLIP2, masked image modeling BEiTv2 vs. iBOT/MAE). None of these factors correlated with the observed failure, as summarized in Table 6.

Table 6
Table 6

Objective Evaluation and Limitations

The study is limited to softmax‑to‑softmax attention‑transfer scenarios and classification tasks. It does not address full‑weight initialization, feature‑level distillation, or parameter‑efficient fine‑tuning. Transfer to dense prediction or vision‑language tasks may require separate validation. The principle of architecture compatibility—whether the student preserves the teacher’s attention‑routing context—remains to be examined for cross‑size transfers (large teacher → small student).

Takeaways for Practitioners

Do not assume attention transfer works universally. Verify that teacher and student architectures match before applying the technique.

Simple rescue: add the teacher’s native components (LayerScale, Pre‑LayerNorm, etc.) to the student, even with random initialization, to recover performance.

Rethink distillation design: prioritize architecture compatibility over merely strengthening the distillation signal.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Deep LearningModel EvaluationKnowledge DistillationVision TransformerArchitecture CompatibilityAttention Transfer
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.