Cursor’s Composer 2 Beats Claude Opus 4.6 with ‘Ankle‑Cut’ Pricing via New Reinforcement‑Learning Method
Cursor’s newly released Composer 2 model surpasses Claude Opus 4.6 on benchmarks such as Terminal‑Bench 2.0, offers dramatically lower token pricing, and achieves these gains by introducing a novel self‑summary reinforcement‑learning technique that compresses long‑context tasks while preserving critical information.
Composer 2 performance and pricing
Composer 2, Cursor’s programming‑focused LLM, outperforms Claude Opus 4.6 on all evaluated benchmarks, including Terminal‑Bench 2.0 and SWE‑bench Multilingual. On Terminal‑Bench 2.0 its score lies between GPT‑5.4 and Claude Opus 4.6, indicating a substantial capability increase.
Pricing (per million tokens):
Standard Composer 2 – input $0.5, output $2.5
Composer 2 Fast – input $1.5, output $7.5 (higher throughput)
Self‑summary reinforcement‑learning method
Cursor introduced a reinforcement‑learning (RL) technique called “self‑summary”. The model learns to generate concise summaries of its own intermediate context, and these summaries are incorporated into the reward signal: good summaries increase the reward, loss of information incurs a penalty.
During inference the model follows a loop:
Generate tokens until a fixed length trigger is reached.
Insert a synthetic query that asks the model to summarize the current context.
Provide a drafting space for the model to construct an optimal summary, then compress it.
Feed the compressed summary together with state information (remaining tasks, previous summaries, planning state) back to step 1.
This mechanism is trained, not a runtime hack; the RL objective explicitly rewards retaining useful information and penalizes forgetting.
Quantitative comparison with traditional summarization
Traditional summarization approaches for long‑context programming tasks require >5 000 tokens for a single task and produce compressed outputs of ~5 000 + tokens. Composer 2 uses a single prompt (e.g., “Please summarize the conversation”) and its compressed output averages ~1 000 tokens—about one‑fifth the token usage. The reduction in token count correlates with an approximate 50 % drop in compression‑induced errors.
Long‑chain task demonstration: Doom on MIPS
Composer 2 was tasked with running the Doom game on a MIPS architecture, a benchmark that forces the model to modify code, compile, and iteratively debug. After 170 interaction rounds the model compressed over 100 k tokens of intermediate data into ~1 k tokens and produced a correct solution.
Future direction
Cursor has announced a forthcoming Composer 3, indicating continued rapid iteration of the self‑summary capability.
Code example
[1]https://x.com/mntruell/status/2034729462211002505
[2]https://x.com/RoboIntellect/status/2034693646822580431?s=20
[3]https://x.com/cursor_ai/status/2033967614309835069?s=20Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
