Artificial Intelligence 21 min read

Why State Space Models May Outperform Transformers: A Deep Dive

The article provides a comprehensive technical analysis of state space models (SSM) versus Transformers, covering their core mechanisms, three essential design factors, training efficiency, scaling behavior, tokenization debates, and experimental evidence that highlights the trade‑offs and potential advantages of SSMs in modern AI systems.

Data Party THU

Aug 5, 2025

Why State Space Models May Outperform Transformers: A Deep Dive

Background

Albert Gu, a researcher at CMU and chief scientist at Cartesia AI, recently published a blog adapted from a year‑long series of talks that examines the trade‑offs between state space models (SSM) and Transformers.

What Is a State Space Model?

An SSM can be viewed as a modern variant of a recurrent neural network (RNN) that maintains a hidden state h_t of dimension N, which is typically larger than the input and output dimensions. This large hidden state enables the model to store rich contextual information for downstream tasks.

Three Key Design Elements

State Size : The hidden state dimension N (often called the state space or expansion factor) can be many times larger than the input dimension, allowing the model to retain more information than classic RNNs such as LSTM or GRU.

State Expressivity : Early linear‑time‑invariant SSMs used a fixed recursion h_t = A h_{t-1} + B x_t. Modern selective SSMs, exemplified by Mamba, employ data‑dependent, time‑varying transition matrices, giving the recurrence much higher expressive power and linking closely to gated RNN mechanisms.

Training Efficiency : Expanding the state incurs computational challenges. Mamba overcomes these by carefully parameterizing the recurrence and applying parallel scan algorithms, achieving practical GPU/TPU efficiency.

Common Technical Themes in Recent Models

Parallelism : Implementations aim for GPU/TPU‑level parallelism, often using matrix multiplications as the primary operation.

Memory Management : Large state expansions cannot be materialized in full memory; Mamba uses deep knowledge of GPU memory hierarchies to keep memory usage bounded.

Linear Features : Maintaining linear relationships with the input x_t is crucial for both computational efficiency and model optimization.

Systematic Integration in Mamba

The three technical ingredients—linear attention‑style state expansion, gated‑RNN‑inspired selective updates, and parallel scan algorithms—are not novel individually but their combination yields language‑model performance comparable to Transformers.

Modern Recurrent Model Landscape

Models such as RWKV, xLSTM, Griffin, and linear‑attention variants (GLA, Gated DeltaNet) share the same SSM core: a matrix‑based state, a data‑dependent transition, and efficient parallel training. Many recent works treat these as a unified family of “modern recurrent models.”

Analogy: Brain vs. Database

Transformers cache every token, acting like a database that stores each observation for later lookup. In contrast, SSMs compress the entire history into a fixed‑size hidden state, resembling a brain that processes information online with limited memory.

Tokenization Debate

Gu provocatively claimed “tokens are bullshit,” arguing that tokenization is a patch for Transformer limitations. Removing tokenization (feeding raw bytes) forces Transformers to use more FLOPs yet often degrades performance relative to SSMs, which handle raw sequences efficiently.

Empirical Findings

When FLOPs are matched, SSMs consistently achieve lower perplexity than tokenized Transformers.

Even with substantially more compute, token‑free Transformers lag behind SSMs.

In DNA language‑modeling tasks, Mamba outperforms Transformers without special tuning.

Inductive Bias Perspective

Transformers exhibit a strong bias toward attending to every individual token, which can be viewed as a hard attention bias. SSMs, by compressing the sequence, embody a different inductive bias that can be more suitable for long‑range, low‑semantic‑density data.

Scaling Laws and Efficiency

Both architectures benefit from scaling laws, but the efficiency gap widens as sequence length grows. The article suggests that the optimal ratio of SSM to attention layers lies roughly between 3:1 and 10:1 for many tasks.

Conclusion

The analysis concludes that while Transformers remain powerful, they are not the ultimate solution; SSM‑based models offer compelling advantages in memory efficiency, scaling behavior, and robustness to noisy or un‑tokenized data, warranting further research into hybrid designs.