Artificial Intelligence 9 min read

GPT-5.5 Launches Overnight, Beats Claude Opus 4.7 in Key Programming Benchmarks

OpenAI unveiled GPT-5.5 at 2 a.m., emphasizing autonomous task execution; benchmark tables show it outperforms Claude Opus 4.7 in most programming and agentic tests while lagging on a few specialized metrics, and it also offers token‑efficiency gains, new research‑assistant capabilities, and updated pricing.

AI Insight Log

Apr 23, 2026

GPT-5.5 Launches Overnight, Beats Claude Opus 4.7 in Key Programming Benchmarks

OpenAI announced the official release of GPT‑5.5 in the early hours of the morning, positioning the model as a "new class of intelligence for real work" that can accept messy, multi‑step tasks, plan, invoke tools, verify results, and handle ambiguous scenarios without premature termination.

The accompanying benchmark table compares GPT‑5.5, GPT‑5.5 Pro, Claude Opus 4.7, and Gemini 3.1 Pro across several tests. GPT‑5.5 leads in Terminal‑Bench 2.0 (82.7 % vs 69.4 % for Claude), GDPval (84.9 % vs 80.3 %), OSWorld‑Verified (78.7 % vs 78.0 %), BrowseComp (84.4 % vs 79.3 %), CyberGym (81.8 % vs 73.1 %) and several other metrics, while Claude retains an edge on FrontierMath Tier 4 (39.6 % vs 35.4 % for GPT‑5.5) and on a few specialized tasks.

Claude Opus 4.7 outperforms GPT‑5.5 on SWE‑Bench Pro (64.3 % vs 58.6 %), MCP Atlas tool‑calling (79.1 % vs 75.3 %), Humanity’s Last Exam (46.9 % vs 41.4 %), FinanceAgent v1.1 (64.4 % vs 60.0 %) and several long‑text retrieval benchmarks such as Graphwalks 256K.

Agentic coding is highlighted as a major selling point. Every’s founder Dan Shipper reported that GPT‑5.5 reproduced a senior engineer’s refactoring solution for a buggy application, a feat GPT‑4.4 could not achieve. MagicPath CEO Pietro Schirano demonstrated a 20‑minute, one‑shot merge of two heavily modified branches. An early NVIDIA test engineer described losing access to GPT‑5.5 as “like having an arm cut off,” and Cursor CEO Michael Truell quoted, "GPT‑5.5 is noticeably smarter, more resilient, and can stay on complex long‑running tasks longer than 5.4."

"GPT‑5.5 is clearly smarter and more robust than 5.4, with stronger programming performance and more reliable tool use. Most importantly, it can persist on tasks longer without stopping early, which is crucial for Cursor users assigning complex long‑running tasks."

Despite its larger size, GPT‑5.5 maintains per‑token latency comparable to 5.4 while significantly reducing total token consumption on identical Codex tasks. OpenAI attributes this to a co‑designed hardware‑software stack with NVIDIA GB200/GB300 NVL72 systems. The model’s own infrastructure optimizations, such as a custom GPU‑slice scheduling heuristic, have increased token‑generation speed by over 20 %.

Research assistance capabilities are expanded with new benchmarks: GeneBench (25.0 % for GPT‑5.5 vs 19.0 % for 5.4), BixBench (leading scores), and a custom‑harness GPT‑5.5 discovered a new proof about Ramsey numbers that was formally verified in the Lean proof assistant. Jackson Lab immunology professor Derya Unutmaz used GPT‑5.5 Pro to analyze a 62‑sample, ~28 000‑gene dataset, producing a detailed report that would have taken months manually. At Adam Mickiewicz University, professor Bartosz Naskręcki generated an algebraic‑geometry application in 11 minutes from a single prompt.

Pricing has increased: the gpt‑5.5 model costs 30 USD per million input tokens and 1 M context window; gpt‑5.5‑pro costs 180 USD per million output tokens. Batch and Flex modes are priced at half the standard rate, while Priority mode is 2.5 × the standard rate. In Codex, GPT‑5.5’s context window is 400 K; Fast mode boosts generation speed by 1.5 × at 2.5 × the cost. OpenAI’s stance is that, although the per‑token price is higher, the reduced token usage makes many tasks cheaper overall.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

benchmark Token Efficiency agentic coding Claude Opus 4.7 GPT-5.5 AI research assistance

Written by

AI Insight Log

Focused on sharing: AI programming | Agents | Tools

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.