Turning Transformers into Mamba: A Cross‑Architecture Distillation That Linearizes Inference Cost
The article presents a two‑step cross‑architecture distillation method that replaces the quadratic softmax attention of Transformers with a learned linear attention and then maps it onto a Mamba backbone, achieving near‑teacher performance while reducing inference cost to linear time.
