Beyond Transformers: Exploring Post‑Transformer Architectures for Long‑Sequence Modeling

This article reviews the emerging post‑Transformer research landscape, covering linear state‑space models, efficient attention approximations, MLP/conv/RNN hybrids, sparse and causal attention mechanisms, and outlines future trends that may complement or replace the classic Transformer architecture for handling ultra‑long sequences.

Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Beyond Transformers: Exploring Post‑Transformer Architectures for Long‑Sequence Modeling

Introduction

Technical progress never waits; even the dominant Transformer architecture is now facing intensive research on its limitations in long‑sequence processing and computational efficiency. Since 2021, the community has explored a variety of "Post‑Transformer" techniques to overcome these bottlenecks.

1. Linear State‑Space Models (SSM) and Variants

Core idea: Inspired by control theory, SSMs map sequences to hidden states that evolve according to linear differential equations, combined with nonlinear control terms for output generation. Unlike global attention, SSMs keep a fixed‑dimensional hidden state, offering theoretically infinite‑length dependency modeling with linear time complexity.

Early work such as HiPPO optimized hidden‑state updates with orthogonal polynomial projections. The S4 model (NeurIPS 2021) parameterized diagonal + low‑rank hidden matrices and used frequency‑domain convolutions, achieving strong results on long‑sequence tasks. Simpler variants like DSS approximate S4 performance with only diagonal matrices.

The Mamba model (Selective SSM, 2023) introduced input‑dependent state‑update matrices, enabling selective memory or forgetting similar to gating mechanisms, and replaced frequency‑domain convolutions with a hardware‑friendly parallel scan algorithm. Mamba‑3B outperforms a same‑size Transformer and approaches the performance of a Transformer with twice the parameters, while delivering ~5× higher throughput on million‑length sequences.

Performance & Applications: Mamba demonstrates that attention‑free architectures can match or exceed Transformers on language modeling, especially for ultra‑long sequences where Transformers struggle.

Limitations: Implementing SSMs requires complex mathematics, custom CUDA kernels, and careful initialization (e.g., HiPPO matrices). Training can suffer from gradient explosion or convergence difficulties, and their suitability for discrete logical reasoning remains unproven.

Mamba development history
Mamba development history

2. Efficient Attention Alternatives (Linear/Approximate Attention)

Core idea: Reduce the O(n²) cost of standard attention by reformulating it as a kernel, low‑rank, or sparse approximation, achieving linear or sub‑linear complexity.

Kernel‑based linear attention: Linear Transformer (2020) and Performer (2021) rewrite softmax attention as a kernel function. Performer’s FAVOR+ method uses random feature mappings to approximate softmax inner products, enabling unbiased linear‑time attention with high probability of low error.

Low‑rank / sparse approximations: Linformer projects keys and values to a smaller dimension, reducing complexity to O(Nk). Reformer applies locality‑sensitive hashing to cluster similar queries/keys, achieving O(N log N) complexity.

These methods dramatically lower memory and compute for long sequences (e.g., Performer handles >8k tokens, Linformer reduces GPU memory while preserving accuracy). Early linear attention sometimes lags behind full attention on complex language tasks, but later improvements (local bias, unique query embeddings) narrow the gap.

Performance & Applications: Performer improves protein‑sequence modeling accuracy by 5 % and supports 8192‑token inputs. Linear Transformer trains three times faster than a standard Transformer on CIFAR‑10 generation, with generation speed up to 4000× while maintaining comparable quality. Recent advances (e.g., LoLA) bring linear attention closer to softmax performance, and libraries like xFormers now include these efficient modules.

3. MLP / Convolution / RNN Hybrid Architectures

Core idea: Replace explicit attention with MLPs, convolutions, or recurrent units, designing structures that enable cross‑position interaction.

Pure MLP: gMLP (2021) uses gating mechanisms to achieve performance comparable to Transformers on image classification and language modeling.

Convolutional hybrids: ConvMixer (2022) combines patching with depth‑wise separable convolutions, rivaling ViT on ImageNet.

RNN revival: RWKV (2023) merges Transformer and RNN advantages, allowing parallel training and constant‑memory inference, scaling up to 140 B parameters.

These approaches perform well on specific tasks (e.g., image classification) but may struggle with complex logical reasoning or multimodal alignment, where attention’s global context is beneficial. Convolutional models face limitations on extremely long texts due to limited receptive fields, while RNN‑style models require sophisticated training tricks to avoid gradient issues.

Overall insight: Attention is not the only viable design; hybrid architectures provide diversity and can excel in niche scenarios, guiding future model design.

4. Sparse Attention & Causal Modeling (RetNet, Long‑Range Recursion)

Core idea: Reduce attention density or adopt recursive processing to improve efficiency for autoregressive generation.

Sparse local attention: Longformer uses sliding windows with a few global tokens, achieving linear complexity. BigBird combines windows, random connections, and global tokens, theoretically approximating full attention while maintaining high accuracy on long‑text tasks.

Hierarchical attention: Transformer‑XL introduces a memory mechanism that carries hidden states across segments, enabling context beyond fixed windows. LongNet (2023) proposes dilated attention, using exponential spacing to achieve pyramid‑like receptive fields, theoretically supporting billion‑length sequences.

RetNet: A fully recurrent causal model from Microsoft Research Asia (2023) employs a decay‑based state vector to retain recent information, offering parallel training and high‑throughput inference comparable to Transformers.

State‑space for causal modeling: SSMs can be adapted for autoregressive generation, updating a compact hidden state each step. Models like Mega (2023) simplify S4 to a real‑valued exponential moving average and combine it with approximate attention, showing strong language‑modeling results.

Performance & Limitations: Sparse and recurrent models often outperform Transformers on ultra‑long tasks, but may require task‑specific pattern design, can incur slight accuracy loss, and lack mature tooling. Large‑scale benchmarks (e.g., trillion‑token models) are still scarce, and training costs remain high.

5. Outlook: Post‑Transformer Trends

Transformers will coexist with emerging architectures. Anticipated directions include:

Mixed architectures that fuse Transformers with SSMs, RNNs, or other modules to leverage complementary strengths.

Ultra‑long context handling (hundreds of thousands to millions of tokens) via sparse attention, chunked processing, or retrieval‑augmented generation.

Hardware‑friendly designs that optimize memory locality and parallelism (e.g., FlashAttention, specialized kernels for SSM/RNN).

Theoretically grounded, more interpretable models—state‑space dynamics, explicit decay mechanisms, and memory modules—facilitating diagnosis and control.

The future is likely a diverse ecosystem where attention‑based models are no longer the sole paradigm, but rather one of many specialized solutions.

This article is shared from Huang Danian Tea‑Think Technology website.
AITransformerhybrid architectureSparse AttentionEfficient AttentionState Space Model
Huawei Cloud Developer Alliance
Written by

Huawei Cloud Developer Alliance

The Huawei Cloud Developer Alliance creates a tech sharing platform for developers and partners, gathering Huawei Cloud product knowledge, event updates, expert talks, and more. Together we continuously innovate to build the cloud foundation of an intelligent world.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.