Cross-Architecture Distillation — 1 Technical Articles

Apr 22, 2026 · Artificial Intelligence

Apple Turns Transformers into Mamba with Linear‑Cost Distillation

Apple proposes a two‑step cross‑architecture distillation that converts expensive, high‑performing Transformers into cheaper, nearly equally strong Mamba models by first replacing softmax attention with learned linear attention (Hedgehog) and then embedding this intermediate form into Mamba, achieving comparable perplexity and downstream task performance with far lower inference cost.

Artificial IntelligenceCross-Architecture DistillationLinear Attention

0 likes · 7 min read

Apple Turns Transformers into Mamba with Linear‑Cost Distillation