Turning Transformers into Mamba: A Cross‑Architecture Distillation That Linearizes Inference Cost

The article presents a two‑step cross‑architecture distillation method that replaces the quadratic softmax attention of Transformers with a learned linear attention and then maps it onto a Mamba backbone, achieving near‑teacher performance while reducing inference cost to linear time.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Turning Transformers into Mamba: A Cross‑Architecture Distillation That Linearizes Inference Cost

Transformers achieve strong performance but their softmax attention has quadratic cost, making inference expensive for long contexts such as code, agents, or multi‑turn reasoning. Linear‑attention alternatives (e.g., RWKV, Mamba) are cheaper but noticeably weaker, especially at larger scales.

Two‑phase distillation pipeline (HedgeMamba)

Phase 1 – Hedgehog linear attention : The original softmax attention is replaced by a learned linear attention. Using Mercer’s theorem, a small MLP learns a feature mapping that approximates the softmax kernel. The linear‑attention outputs are aligned to the Transformer’s outputs via cosine‑similarity distillation, and a lightweight normalization restores the original attention output format.

Phase 2 – Embedding into Mamba : The aligned linear attention is mapped onto Mamba’s internal parameters so that the Mamba model is initialized to behave like the intermediate Hedgehog model rather than learning from scratch. An additional normalization keeps the output distribution similar, after which the model is fine‑tuned with standard cross‑entropy loss and Mamba’s native convolution and gating mechanisms are re‑enabled.

Need performance? Use Transformer (expensive); need cheap? Use Mamba (weaker).

Experimental results

A 1‑billion‑parameter model trained on ~10 B tokens achieves perplexity 14.11, close to the teacher Transformer (13.86) and better than the Hedgehog baseline (14.89).

On downstream benchmarks (Arc, PIQA, BoolQ, RACE, LogiQA) HedgeMamba matches or exceeds the Hedgehog baseline and approaches the teacher’s scores, indicating preservation of reasoning ability.

A direct one‑step distillation from Transformer to Mamba collapses performance (perplexity > 100), confirming the necessity of the staged approach.

Ablation studies show that the gating mechanism, rather than mere stacking of modules, drives Mamba’s effectiveness.

Allocating most training data to Phase 2 (light Phase 1, heavy Phase 2) yields the best results.

Scaling the token count from 1 B to 10 B tokens yields steady performance gains without divergence, demonstrating scalability.

The pipeline provides a practical way to retrofit existing Transformer checkpoints into more compute‑efficient Mamba models, potentially lowering inference costs for open‑source and commercial deployments.

Reference: https://arxiv.org/abs/2604.14191

model compressionDistillationLinear AttentionMambaCross‑Architecture
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.