Artificial Intelligence 12 min read

GPT-5.5 Beats GPT-5.4, Yet Opus 4.7 Still Tops Coding – Price Doubles

OpenAI’s GPT-5.5 surpasses its predecessor on most benchmarks, offering lower token usage and stronger agentic, research, and coding capabilities, but falls behind Anthropic’s Claude Opus 4.7 on the SWE‑Bench Pro coding test, while its API price has doubled to $5/$30 per million tokens.

ShiZhen AI

Apr 23, 2026

GPT-5.5 Beats GPT-5.4, Yet Opus 4.7 Still Tops Coding – Price Doubles

Background and Release Context

Amid a heated AI model market where Anthropic’s Claude Opus 4.7 and Google’s Gemini 3.1 Pro were gaining attention, OpenAI launched GPT-5.5 as a response, positioning it as the next‑generation model for real‑world work and agent‑driven tasks. The model is available via ChatGPT, Codex, and API tiers (Plus, Pro, Business, Enterprise) with pricing of $5 per million input tokens and $30 per million output tokens.

Benchmark Results

OpenAI published a suite of benchmarks. The most notable figures are:

Terminal‑Bench 2.0 : GPT‑5.5 82.7 % vs GPT‑5.4 75.1 % (Claude Opus 4.7 69.4 %, Gemini 3.1 Pro 68.5 %).

Expert‑SWE (internal) : GPT‑5.5 73.1 % vs GPT‑5.4 68.5 %.

GDPval : GPT‑5.5 84.9 % vs GPT‑5.4 83.0 % (Claude Opus 4.7 80.3 %).

OSWorld‑Verified : GPT‑5.5 78.7 % vs GPT‑5.4 75.0 % (Claude Opus 4.7 78.0 %).

ARC‑AGI‑2 : GPT‑5.5 85.0 % vs GPT‑5.4 73.3 % (Claude Opus 4.7 75.8 %, Gemini 3.1 Pro 77.1 %).

SWE‑Bench Pro (public) : GPT‑5.5 58.6 % vs GPT‑5.4 57.7 % (Claude Opus 4.7 64.3 %, Gemini 3.1 Pro 54.2 %).

Tau2‑bench Telecom : GPT‑5.5 98.0 % vs GPT‑5.4 92.8 % (no external competitor reported).

The first five benchmarks show GPT‑5.5 leading, while the programming benchmark SWE‑Bench Pro still lags behind Opus 4.7 by 5.7 points, creating the primary controversy.

Programming Capability

OpenAI cites internal evaluations to claim “the strongest programming model,” yet the public SWE‑Bench Pro results contradict that claim. Community observers note that OpenAI promoted SWE‑Bench Pro as hard to game in February, but now relegated it to an appendix.

Terminal‑Bench 2.0, which measures complex command‑line workflows requiring planning, iteration, and tool coordination, shows GPT‑5.5 at 82.7 %—a substantial margin over Opus 4.7’s 69.4 % and closer to real‑world Codex usage scenarios.

"GPT‑5.5 is noticeably smarter and more persistent than GPT‑5.4, with stronger programming performance and more reliable tool use. It can stay on long‑running tasks without dropping out," said Cursor CEO Michael Truell.

An NVIDIA engineer remarked that losing access to GPT‑5.5 feels like “losing an arm.”

Agentic (Smart‑Agent) Capability

The most striking improvement is GPT‑5.5’s ability to complete tasks end‑to‑end. In OSWorld‑Verified, which tests autonomous control of a real computer environment, GPT‑5.5 scores 78.7 %, essentially matching Claude Opus 4.7’s 78.0 %.

In the Tau2‑bench Telecom scenario—simulating a complex customer‑service workflow without prompt tuning—GPT‑5.5 achieves 98.0 % versus GPT‑5.4’s 92.8 %.

Internal usage data shows over 85 % of OpenAI staff using Codex weekly; the Finance team processed 24,771 K‑1 tax forms (71,637 pages) two weeks ahead of schedule; a Go‑to‑Market employee saved 5‑10 hours per week by automating weekly business reports.

Research Assistance

GPT‑5.5 shows unexpected gains in scientific tasks. On GeneBench (multi‑stage genetics data analysis), GPT‑5.5 improves from 19 % (GPT‑5.4) to 25 %, while GPT‑5.5 Pro reaches 33.2 %.

An internal GPT‑5.5 version, combined with a custom workflow, discovered a new proof related to the Ramsey number, later formalized in Lean.

Immunology professor Derya Unutmaz used GPT‑5.5 Pro to analyze a 62‑sample, ~28,000‑gene expression dataset, producing a detailed report that would have taken her team months.

Mathematician Bartosz Naskręcki built an algebraic‑geometry visualization app in 11 minutes using Codex.

Pricing and Competitive Landscape

API pricing for GPT‑5.5 is $5 per million input tokens and $30 per million output tokens, double the $2.50/$15 rates of GPT‑5.4. Anthropic’s Claude Opus 4.7 is priced at $5/$25, making GPT‑5.5’s output cost about 20 % higher than its flagship competitor.

OpenAI justifies the higher price by noting lower token consumption for the same tasks, which may offset costs in Codex‑heavy workflows but not in simple Q&A scenarios.

Community comparisons between GPT‑5.5 Pro and the unreleased Anthropic Mythos show similar benchmark performance, but GPT‑5.5 is publicly accessible while Mythos is not.

Safety Enhancements

GPT‑5.5 includes a substantial safety framework. In the Preparedness Framework, its network‑security and bio/chemical capabilities are rated “High” (not “Critical”). A stricter classifier is deployed for high‑risk network activities, leading some early users to perceive the model as more conservative.

OpenAI also launched a “Trusted Access for Cyber” program, allowing vetted security researchers and critical‑infrastructure defenders to request access to a less‑restricted version via chatgpt.com/cyber.

The tiered‑access approach balances capability with risk, acknowledging that fully locking down powerful models could benefit attackers.

Conclusion

GPT‑5.5 represents a genuine step forward in agentic programming, long‑context handling, and research assistance, but it is not without drawbacks: it trails Opus 4.7 on the SWE‑Bench Pro coding benchmark, carries the highest API price in two years, and still exhibits higher hallucination rates compared to Anthropic models (≈86 % vs 36 % in third‑party tests). Whether the price premium is justified depends on the specific use case—agentic coding workflows may find it worthwhile, while ordinary chat usage might wait for broader API availability.

benchmark AI model pricing Agentic AI coding research assistance GPT-5.5

Written by

ShiZhen AI

Tech blogger with over 10 years of experience at leading tech firms, AI efficiency and delivery expert focusing on AI productivity. Covers tech gadgets, AI-driven efficiency, and leisure— AI leisure community. 🛰 szzdzhp001

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.