7 min read

Apple Turns Transformers into Mamba with Linear‑Cost Distillation

Apple proposes a two‑step cross‑architecture distillation that converts expensive, high‑performing Transformers into cheaper, nearly equally strong Mamba models by first replacing softmax attention with learned linear attention (Hedgehog) and then embedding this intermediate form into Mamba, achieving comparable perplexity and downstream task performance with far lower inference cost.

Machine Heart

Apr 22, 2026

Apple Turns Transformers into Mamba with Linear‑Cost Distillation

Apple introduces a two‑stage cross‑architecture distillation pipeline that transforms a standard softmax‑based Transformer into a more compute‑efficient Mamba model while preserving most of its performance. The first stage replaces the costly softmax attention with a learned linear attention module (named Hedgehog) using a small MLP trained via Mercer‑theorem‑based feature mapping, and aligns its outputs to the original Transformer through cosine‑similarity distillation.

In the second stage, the aligned linear‑attention module is integrated into the Mamba architecture. Crucially, the authors map the attention computation onto Mamba’s internal parameters so that the model’s initialization already mirrors the intermediate representation, avoiding a cold‑start training from scratch. An additional normalization step compensates for the lack of softmax’s inherent normalization, keeping the output distribution close to the original while retaining linear complexity.

Experiments on a 1‑B parameter model trained with ~10 B tokens show that the resulting HedgeMamba achieves a perplexity of 14.11, close to the Transformer teacher’s 13.86 and markedly better than the Hedgehog baseline’s 14.89. Across downstream benchmarks (Arc, PIQA, BoolQ, RACE, LogiQA) HedgeMamba matches or exceeds the baseline and approaches the teacher’s scores, demonstrating that the method preserves substantial reasoning and semantic capabilities.

The authors also compare direct one‑step distillation, which causes perplexity to explode above 100, confirming that the two‑step approach is a structural necessity rather than an optimization trick. Ablation studies reveal that the gating mechanisms within Mamba, rather than mere stacking of modules, drive the performance gains, and that data allocation favoring the second stage (light S1, heavy S2) yields the best results. Scaling experiments from 1 B to 10 B tokens show stable, monotonic improvement without divergence, indicating the method’s scalability.

Overall, this work demonstrates that existing large‑scale Transformer models can be retrofitted into cheaper Mamba variants through a principled, intermediate‑aligned distillation process, opening a pathway for cost‑effective deployment of open‑source and proprietary models alike.

Artificial Intelligence model compression Transformer Linear Attention Mamba Cross-Architecture Distillation

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.