Artificial Intelligence 14 min read

OpenAI Skips GPT‑5.3, Launches GPT‑5.4: Wins 5 of 8 Benchmarks, Sparks Heated Debate

OpenAI announced GPT‑5.4 at 2 a.m., skipping GPT‑5.3 and claiming integrated coding and reasoning abilities; the model tops five of eight benchmark categories, introduces native computer operation, tool‑search and interruptible thinking, while users debate its trustworthiness and pricing changes.

AI Insight Log

Mar 6, 2026

OpenAI Skips GPT‑5.3, Launches GPT‑5.4: Wins 5 of 8 Benchmarks, Sparks Heated Debate

Model release and versioning

OpenAI announced GPT‑5.4 at 02:00 UTC, stating that it merges the coding abilities of a GPT‑5.3‑Codex variant with the general reasoning of GPT‑5.2, which justifies the direct jump from 5.2 to 5.4.

GPT‑5.4 is launching, available now in the API and Codex and rolling out over the course of the day in ChatGPT. It is much better at knowledge work and web search, has native computer‑use capabilities, supports 1 M‑token context, and can be steered mid‑response.

Benchmark performance

OpenAI evaluated GPT‑5.4, GPT‑5.4 Pro, GPT‑5.2, Claude Opus 4.6 and Gemini 3.1 Pro on eight public benchmarks, running each model with the “maximum reasoning effort” setting. The key results are:

OSWorld (computer control) : GPT‑5.4 75.0 % vs GPT‑5.2 47.3 % vs Claude Opus 72.7 %.

WebArena (web browsing) : GPT‑5.4 67.3 % vs GPT‑5.2 65.4 % vs Claude Opus 66.4 %.

GDPval (knowledge work) : GPT‑5.4 83.0 % vs GPT‑5.2 70.9 % vs Claude Opus 78.0 %.

BrowseComp (search) : GPT‑5.4 82.7 % vs GPT‑5.2 65.8 % vs Claude Opus 84.0 % vs Gemini 3.1 Pro 85.9 %.

SWE‑Bench Pro (coding) : GPT‑5.4 57.7 % vs GPT‑5.2 55.6 % vs Gemini 3.1 Pro 54.2 %.

GPQA Diamond (science) : GPT‑5.4 92.8 % vs GPT‑5.2 68.4 % vs Claude Opus 78.0 % vs Gemini 3.1 Pro 84.0 %.

ARC‑AGI‑2 (abstract reasoning) : GPT‑5.4 73.3 % vs GPT‑5.2 52.9 %.

FrontierMath (math) : GPT‑5.4 47.6 % vs GPT‑5.2 40.7 % vs Gemini 3.1 Pro 36.9 %.

OpenAI notes that the numbers are measured under its own maximum‑effort configuration, so direct cost or efficiency comparisons across vendors should be treated as indicative only.

Three substantive new capabilities

Native computer operation : The model can ingest screenshots, move the mouse, and type on a keyboard, enabling multi‑app workflows such as form filling or batch data entry without writing separate RPA scripts. In the API, developers can send system messages to control behavior and optionally attach custom safety‑confirmation policies. OpenAI demonstrated an end‑to‑end email‑and‑calendar task where the model identified UI elements from a screenshot and clicked coordinates to complete the action.

Tool search : Instead of loading full tool definitions into the prompt, GPT‑5.4 receives a lightweight list of tool identifiers and fetches the full definition only when needed. OpenAI’s Scale MCP‑Atlas benchmark shows a 47 % reduction in total token consumption while maintaining the same accuracy, which can materially lower monthly API bills for agents that reference many tools.

Interruptible “thinking” : The model first returns a reasoning outline (a chain‑of‑thought) and allows the user to intervene—adding instructions or correcting direction—before the full response is generated. This feature is live on chatgpt.com and Android; iOS support is announced.

Pricing and token‑context changes

API pricing increased relative to GPT‑5.2:

Input price: $1.75 /M → $2.50 /M (≈ 43 % increase).

Output price: $14 /M → $15 /M (≈ 7 % increase).

Cache‑input price: $0.175 /M → $0.25 /M.

GPT‑5.4 Pro is priced at $30 /M input and $180 /M output.

OpenAI argues that the higher token‑efficiency (e.g., 47 % fewer tokens for tool search) may offset the higher per‑token rates, but actual savings depend on workload patterns.

The model also supports an experimental 1 M‑token context window via the model_context_window setting; requests exceeding the standard 272 K tokens are billed at double rate.

Additional technical details

Hallucination rate dropped 33 % per statement and 18 % per full response compared with GPT‑5.2, based on user‑flagged factual errors.

Investment‑modeling benchmark (simulated junior analyst spreadsheet task) rose from 68.4 % (GPT‑5.2) to 87.3 % (GPT‑5.4).

ARC‑AGI‑2 abstract‑reasoning score: 73.3 % (GPT‑5.4) vs 52.9 % (GPT‑5.2); GPT‑5.4 Pro reaches 83.3 %.

Safety rating is labeled “High” with monitoring, trusted‑access controls, and asynchronous blocking; zero‑data‑retention (ZDR) scenarios still incur request‑level interception.

GPT‑5.2 will remain available in the “Legacy Models” selector until its retirement on 2026‑06‑05.

Competitive landscape

Claude Opus 4.6 remains competitive, edging GPT‑5.4 in BrowseComp (84.0 % vs 82.7 %) and trailing only slightly in OSWorld (72.7 % vs 75.0 %). Gemini 3.1 Pro matches or exceeds GPT‑5.4 in GPQA Diamond (94.3 % vs 92.8 %) and BrowseComp (85.9 % vs 82.7 %). The most notable progress comes from OpenAI’s own version jumps: OSWorld improved from 47.3 % to 75.0 %, BrowseComp from 65.8 % to 82.7 %, and FrontierMath from 40.7 % to 47.6 %.

Cold‑thinking

Benchmark scores are obtained under ideal, maximum‑effort settings; real‑world usage may encounter edge cases, token‑budget constraints, and trust concerns. While safety upgrades (controlled chain‑of‑thought, custom confirmation policies) represent steps forward, their effectiveness in addressing deeper user worries about surveillance or autonomous weaponization remains to be validated through prolonged deployment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language model benchmark OpenAI AI capabilities GPT-5.4

Written by

AI Insight Log

Focused on sharing: AI programming | Agents | Tools

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.