Turning Transformers into Mamba: How Apple Linearized Inference Costs

Apple introduced a two‑step cross‑architecture distillation method that converts costly quadratic‑time Transformers into cheaper linear‑time Mamba models, preserving most of the original performance while dramatically reducing inference cost.

Data Party THU
Data Party THU
Data Party THU
Turning Transformers into Mamba: How Apple Linearized Inference Costs

Problem

Transformer models use softmax attention, whose compute and memory scale quadratically with sequence length. This becomes prohibitive for long contexts such as code, agents, or multi‑turn reasoning. Linear‑attention alternatives (e.g., linear attention, RWKV, Mamba) reduce cost but exhibit a noticeable performance gap compared with softmax attention.

Method: HedgeMamba Cross‑Architecture Distillation

Stage 1 – Linear‑Attention Intermediate (Hedgehog)

Replace the original softmax attention with a learned linear‑attention module called Hedgehog . An MLP maps input features to a linear‑attention kernel; the mapping is derived from Mercer’s theorem. The Hedgehog model is trained to align its outputs with the Transformer teacher using cosine‑similarity distillation. Because linear attention lacks the softmax’s inherent normalization, an additional normalization step is applied to the Hedgehog outputs.

Linear attention (Hedgehog) distillation
Linear attention (Hedgehog) distillation

Stage 2 – Structural Alignment with Mamba

Embed the aligned Hedgehog module into the Mamba architecture. The core attention computation is mapped onto Mamba’s internal parameters so that the Mamba model initializes with a representation already close to the Hedgehog intermediate, avoiding training from scratch. This structural alignment preserves the efficiency of Mamba while allowing its convolutional and gating mechanisms to be fine‑tuned.

After the two‑step conversion, the model is fine‑tuned with standard cross‑entropy loss, re‑enabling Mamba’s gating and convolutional layers to recover capabilities beyond mere imitation.

Experiments

Using only ~10 B tokens (≈2.7 % of the teacher’s data), the HedgeMamba 1 B model achieves a perplexity of 14.11 , close to the Transformer teacher’s 13.86 and substantially better than the Hedgehog baseline’s 14.89 . Down‑stream benchmarks (Arc, PIQA, BoolQ, RACE, LogiQA) show near‑teacher performance, indicating that the method recovers both probability distributions and reasoning ability.

Perplexity comparison: Teacher vs. Hedgehog vs. HedgeMamba
Perplexity comparison: Teacher vs. Hedgehog vs. HedgeMamba

Ablation Studies and Scaling

The gating mechanism, rather than merely stacking modules, is critical for Mamba’s effectiveness.

Data allocation across stages is asymmetric: a light Stage 1 and a heavy Stage 2 yields the best results.

Scaling the token count from 1 B to 10 B tokens produces stable performance improvements without divergence, demonstrating scalability.

Direct one‑step distillation from Transformer to Mamba leads to catastrophic failure (perplexity > 100), confirming that the two‑step path is a necessary structural condition.

Conclusion

HedgeMamba demonstrates that large‑scale Transformer models can be retrofitted into more efficient Mamba architectures through a principled two‑step cross‑architecture distillation pipeline. The approach bridges the performance gap between softmax and linear attention while substantially reducing inference cost.

Reference: https://arxiv.org/abs/2604.14191

Code example

来源:机器之心
本文
约2500字
,建议阅读
5
分钟
一种从Transformer到
Mamba的新型跨架构蒸馏方法。
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

model compressionTransformerAI researchperplexitylinear attentionMambacross-architecture distillation
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.