Artificial Intelligence 14 min read

Beyond Transformers: SubQ Achieves 12‑Million‑Token Context at Just 5% of Opus Cost

The SubQ model introduces Subquadratic Sparse Attention (SSA), a content‑dependent routing mechanism that reduces attention complexity to linear, enabling a 12‑million‑token context window with a 52.2× speedup and only 5% of Opus's cost, as demonstrated on MRCR v2, RULER, and SWE‑Bench benchmarks.

Machine Heart

May 6, 2026

Beyond Transformers: SubQ Achieves 12‑Million‑Token Context at Just 5% of Opus Cost

Problem: Scaling Attention for Long Contexts

Modern large language models rely on dense attention, where each token compares with every other token, leading to quadratic computational cost that becomes prohibitive at hundreds of thousands or millions of tokens.

Enterprise AI workloads—code bases, contracts, knowledge bases, spreadsheets, and long‑running agent conversations—require reliable reasoning over such long contexts, but dense attention makes this infeasible.

Existing Mitigations and Their Limits

Approaches that chunk, retrieve, summarize, or orchestrate documents reduce the effective context but introduce new failure modes: loss of positional information, hierarchical structure, and cross‑reference cues, as well as error accumulation in multi‑step agent pipelines.

These scaffolds improve usability without changing the underlying quadratic scaling of attention.

SSA: Subquadratic Sparse Attention

SubQ proposes SSA (Subquadratic Sparse Attention), a content‑dependent selection mechanism that routes attention only to positions deemed informative for each query, abandoning the assumption that any token pair could be important.

Linear compute and memory scaling : Cost depends on the number of selected positions, not the full sequence length.

Content‑based routing : The model decides where to look based on semantic relevance, allowing retrieval from any location.

Sparse retrieval from arbitrary positions : Unlike chunking or compression, SSA preserves the ability to recover specific information from distant tokens.

In practice, SSA reduces the amount of attention computation dramatically, yielding substantial speed gains.

Performance Gains

On a B200 GPU with a 128K token sequence, SSA achieves a 7.2× input‑processing speedup over FlashAttention‑2. Speedups increase with context length: 13.2× at 256K, 23.0× at 512K, and 52.2× at 1M tokens.

Compared to dense attention, SSA lowers FLOPs by 62.5× at the 1M token scale and provides a 52.2× pre‑fill acceleration, making long‑context serving practical.

Training Pipeline

SubQ’s training consists of three stages:

Pre‑training : Builds base language modeling ability and long‑context representations for the selection mechanism.

Supervised fine‑tuning : Aligns model behavior with enterprise tasks such as instruction following, structured reasoning, and code generation.

Reinforcement learning : Optimizes failure modes observed in long‑context retrieval, encouraging the model to attend to high‑information, cross‑reference spans.

The RL stage uses data emphasizing dense information, cross‑reference structure, and long‑range routing, teaching the model to focus on critical evidence regardless of its position.

Evaluation

SubQ is evaluated on two dimensions:

Deployment viability : Computational reduction and wall‑clock speed.

Retrieval capability : Benchmarks RULER and MRCR v2.

On MRCR v2, SubQ scores 65.9%, comparable to Claude Opus 4.6 (78) and surpassing GPT‑5.4 (39) and Gemini 3.1 Pro (23). This highlights the gap between nominal context window size and functional context utilization.

RULER tests multi‑hop retrieval, information aggregation, variable tracking, and selective filtering, crucial for enterprise workflows where early missed references cascade into downstream errors.

SWE‑Bench Verified measures end‑to‑end software engineering ability on real GitHub issues, confirming SubQ’s competence in code understanding, bug localization, and patch generation.

Key Takeaways

SSA fundamentally changes attention scaling from quadratic to linear, enabling practical million‑token contexts with dramatically lower cost—only 5% of Opus’s expense in the reported experiments.

The combination of a novel sparse attention mechanism, a three‑stage training regimen, and strong benchmark performance positions SubQ as a significant step forward for long‑context LLMs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Long Context Sparse attention SubQ

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.