Machine Heart
Apr 22, 2026 · Artificial Intelligence
Apple Turns Transformers into Mamba with Linear‑Cost Distillation
Apple proposes a two‑step cross‑architecture distillation that converts expensive, high‑performing Transformers into cheaper, nearly equally strong Mamba models by first replacing softmax attention with learned linear attention (Hedgehog) and then embedding this intermediate form into Mamba, achieving comparable perplexity and downstream task performance with far lower inference cost.
Artificial IntelligenceCross-Architecture DistillationLinear Attention
0 likes · 7 min read
