SubQ Beats Transformers: 12‑Million‑Token Context Model at Only 5% of Opus Cost

The article analyzes SubQ, a new LLM architecture using Subquadratic Sparse Attention (SSA) to achieve a 12‑million‑token context window with linear compute scaling, delivering up to 52× speedup and costing just 5% of Opus while matching dense‑attention performance on long‑context benchmarks.

Data Party THU
Data Party THU
Data Party THU
SubQ Beats Transformers: 12‑Million‑Token Context Model at Only 5% of Opus Cost

Problem: quadratic cost of dense attention

Standard dense attention computes a pairwise similarity between every query token and all key tokens. The resulting all‑pairs operation has O(N²) time and memory complexity, where N is the sequence length. For contexts of hundreds of thousands to millions of tokens the FLOP count and memory usage become prohibitive, and most of the computed attention weights are near zero, i.e., wasted computation.

Subquadratic Sparse Attention (SSA)

SSA (Subquadratic Selective Attention) replaces the dense all‑pairs assumption with a content‑dependent selection step. For each query the model first predicts which positions are informative and computes attention only on that subset. This yields three core properties:

Linear compute and memory scaling: the cost grows with the number of selected positions rather than the full sequence length.

Content‑based routing: the selection is driven by semantic relevance, so important information can be retrieved regardless of its absolute position.

Sparse retrieval from arbitrary locations: unlike chunking or compression, SSA can attend to any distant token that the selector deems relevant.

Training pipeline

To make the selector reliable, the authors train SubQ in three stages:

Pre‑training: standard language‑modeling objective that also learns long‑context representations needed for the selection mechanism.

Supervised fine‑tuning: instruction‑following, structured reasoning and code‑generation data align the model with enterprise workloads.

Reinforcement learning: a reward that emphasizes high‑information‑density, cross‑reference passages forces the selector to learn routing across large spans and mitigates failure modes where the model relies only on nearby context.

Evaluation methodology

The authors evaluate two dimensions:

Deployment viability: reduction in attention FLOPs and wall‑clock speed on a B200 GPU.

Retrieval capability: performance on the RULER benchmark (multi‑hop retrieval, information aggregation) and the MRCR v2 long‑context retrieval benchmark.

Performance results

On MRCR v2 SubQ achieves a score of 65.9 %, comparable to Claude Opus 4.6 and substantially higher than GPT‑5.4 (39 %) and Gemini 3.1 Pro (23 %).

Attention FLOPs at 1 M tokens are reduced by 62.5× relative to dense attention. Wall‑clock pre‑fill speedups on a B200 GPU are:

128 K tokens: 7.2×

256 K tokens: 13.2×

512 K tokens: 23.0×

1 M tokens: 52.2×

In the 1 M‑token regime the cost is ≈5 % of the Opus baseline. Compared with FlashAttention‑2, SSA provides the same or greater speedup; FlashAttention‑3 adds no further benefit on the same hardware.

SWE‑Bench Verified, which measures end‑to‑end software‑engineering ability on real GitHub issues, confirms that SubQ can locate bugs, reason about constraints across a codebase, and generate correct patches.

Key insight

SSA eliminates the wasteful quadratic computation of dense attention by learning to attend only to semantically relevant positions, thereby delivering linear scaling while preserving the ability to retrieve information from any location in a long sequence.

Code example

来源:机器之心
本文
约5000字
,建议阅读
10
分钟
一种全新的注意力模式。
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Benchmarkreinforcement learningSSAsparse attentionlong-context LLMSubQ
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.