Turning Transformers into Mamba: How Apple Linearized Inference Costs
Apple introduced a two‑step cross‑architecture distillation method that converts costly quadratic‑time Transformers into cheaper linear‑time Mamba models, preserving most of the original performance while dramatically reducing inference cost.
Problem
Transformer models use softmax attention, whose compute and memory scale quadratically with sequence length. This becomes prohibitive for long contexts such as code, agents, or multi‑turn reasoning. Linear‑attention alternatives (e.g., linear attention, RWKV, Mamba) reduce cost but exhibit a noticeable performance gap compared with softmax attention.
Method: HedgeMamba Cross‑Architecture Distillation
Stage 1 – Linear‑Attention Intermediate (Hedgehog)
Replace the original softmax attention with a learned linear‑attention module called Hedgehog . An MLP maps input features to a linear‑attention kernel; the mapping is derived from Mercer’s theorem. The Hedgehog model is trained to align its outputs with the Transformer teacher using cosine‑similarity distillation. Because linear attention lacks the softmax’s inherent normalization, an additional normalization step is applied to the Hedgehog outputs.
Stage 2 – Structural Alignment with Mamba
Embed the aligned Hedgehog module into the Mamba architecture. The core attention computation is mapped onto Mamba’s internal parameters so that the Mamba model initializes with a representation already close to the Hedgehog intermediate, avoiding training from scratch. This structural alignment preserves the efficiency of Mamba while allowing its convolutional and gating mechanisms to be fine‑tuned.
After the two‑step conversion, the model is fine‑tuned with standard cross‑entropy loss, re‑enabling Mamba’s gating and convolutional layers to recover capabilities beyond mere imitation.
Experiments
Using only ~10 B tokens (≈2.7 % of the teacher’s data), the HedgeMamba 1 B model achieves a perplexity of 14.11 , close to the Transformer teacher’s 13.86 and substantially better than the Hedgehog baseline’s 14.89 . Down‑stream benchmarks (Arc, PIQA, BoolQ, RACE, LogiQA) show near‑teacher performance, indicating that the method recovers both probability distributions and reasoning ability.
Ablation Studies and Scaling
The gating mechanism, rather than merely stacking modules, is critical for Mamba’s effectiveness.
Data allocation across stages is asymmetric: a light Stage 1 and a heavy Stage 2 yields the best results.
Scaling the token count from 1 B to 10 B tokens produces stable performance improvements without divergence, demonstrating scalability.
Direct one‑step distillation from Transformer to Mamba leads to catastrophic failure (perplexity > 100), confirming that the two‑step path is a necessary structural condition.
Conclusion
HedgeMamba demonstrates that large‑scale Transformer models can be retrofitted into more efficient Mamba architectures through a principled two‑step cross‑architecture distillation pipeline. The approach bridges the performance gap between softmax and linear attention while substantially reducing inference cost.
Reference: https://arxiv.org/abs/2604.14191
Code example
来源:机器之心
本文
约2500字
,建议阅读
5
分钟
一种从Transformer到
Mamba的新型跨架构蒸馏方法。Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
