8 min read

Turning Transformers into Mamba: A Cross‑Architecture Distillation That Linearizes Inference Cost

The article presents a two‑step cross‑architecture distillation method that replaces the quadratic softmax attention of Transformers with a learned linear attention and then maps it onto a Mamba backbone, achieving near‑teacher performance while reducing inference cost to linear time.

Machine Learning Algorithms & Natural Language Processing

Apr 22, 2026

Turning Transformers into Mamba: A Cross‑Architecture Distillation That Linearizes Inference Cost

Transformers achieve strong performance but their softmax attention has quadratic cost, making inference expensive for long contexts such as code, agents, or multi‑turn reasoning. Linear‑attention alternatives (e.g., RWKV, Mamba) are cheaper but noticeably weaker, especially at larger scales.

Two‑phase distillation pipeline (HedgeMamba)

Phase 1 – Hedgehog linear attention : The original softmax attention is replaced by a learned linear attention. Using Mercer’s theorem, a small MLP learns a feature mapping that approximates the softmax kernel. The linear‑attention outputs are aligned to the Transformer’s outputs via cosine‑similarity distillation, and a lightweight normalization restores the original attention output format.

Phase 2 – Embedding into Mamba : The aligned linear attention is mapped onto Mamba’s internal parameters so that the Mamba model is initialized to behave like the intermediate Hedgehog model rather than learning from scratch. An additional normalization keeps the output distribution similar, after which the model is fine‑tuned with standard cross‑entropy loss and Mamba’s native convolution and gating mechanisms are re‑enabled.

Need performance? Use Transformer (expensive); need cheap? Use Mamba (weaker).

Experimental results

A 1‑billion‑parameter model trained on ~10 B tokens achieves perplexity 14.11, close to the teacher Transformer (13.86) and better than the Hedgehog baseline (14.89).

On downstream benchmarks (Arc, PIQA, BoolQ, RACE, LogiQA) HedgeMamba matches or exceeds the Hedgehog baseline and approaches the teacher’s scores, indicating preservation of reasoning ability.

A direct one‑step distillation from Transformer to Mamba collapses performance (perplexity > 100), confirming the necessity of the staged approach.

Ablation studies show that the gating mechanism, rather than mere stacking of modules, drives Mamba’s effectiveness.

Allocating most training data to Phase 2 (light Phase 1, heavy Phase 2) yields the best results.

Scaling the token count from 1 B to 10 B tokens yields steady performance gains without divergence, demonstrating scalability.

The pipeline provides a practical way to retrofit existing Transformer checkpoints into more compute‑efficient Mamba models, potentially lowering inference costs for open‑source and commercial deployments.

Reference: https://arxiv.org/abs/2604.14191

model compression Distillation Linear Attention Mamba Cross‑Architecture