Artificial Intelligence 15 min read

GPT-5.4 Beats Human Baseline and Cuts Agent Token Use by Half

OpenAI's newly released GPT-5.4 integrates reasoning, coding, computer use, and agent tool calls, achieving a 75% success rate on OSWorld-Verified tasks—surpassing the human baseline—while its Tool Search feature reduces agent token consumption by 47% and supports up to 1 million tokens for long‑running workflows.

ShiZhen AI

Mar 6, 2026

GPT-5.4 Beats Human Baseline and Cuts Agent Token Use by Half

What Is GPT-5.4?

On March 5, 2026 OpenAI launched GPT-5.4 and a higher‑performance variant called GPT-5.4 Pro. The model combines the programming strength of GPT‑5.3‑Codex, the general reasoning of GPT‑5.2, and a new native Computer Use capability that can write Playwright scripts, interpret screenshots, and issue keyboard‑mouse actions.

Five Core Capabilities

Native Computer Use – Writes Playwright code and directly manipulates the desktop; achieves 75.0% success on the OSWorld‑Verified benchmark, exceeding the human baseline of 72.4%.

Million‑Token Context – Both Codex and the API now accept up to 1 M tokens, allowing agents to run long tasks without chopping context.

Top‑Tier Programming – Scores 57.7% on SWE‑Bench Pro, beating GPT‑5.3‑Codex (56.8%) with lower latency.

Tool Search – Retrieves tool definitions on demand instead of stuffing all definitions into the prompt, cutting token usage by 47% while keeping accuracy.

More Efficient Inference – Solves more complex problems with fewer tokens and runs noticeably faster than GPT‑5.2.

Benchmark Overview

Key results from the official evaluation table:

GDPval (professional work): GPT‑5.4 83.0% vs. GPT‑5.2 70.9%.

SWE‑Bench Pro (coding): GPT‑5.4 57.7% vs. GPT‑5.3‑Codex 56.8%.

OSWorld‑Verified (Computer Use): GPT‑5.4 75.0% vs. GPT‑5.3‑Codex 74.0% and GPT‑5.2 47.3%.

Toolathlon (tool calling): GPT‑5.4 54.6% vs. GPT‑5.3‑Codex 51.9%.

BrowseComp (web search): GPT‑5.4 82.7% vs. GPT‑5.3‑Codex 77.3%.

GDPval measures performance across 44 real‑world professional scenarios (sales PPTs, accounting sheets, scheduling tables, manufacturing diagrams, etc.). GPT‑5.4 matches or exceeds human experts in 83% of cases, a clear jump from GPT‑5.2’s 70.9%.

Computer Use in Practice

The model offers two approaches: generating Playwright scripts for developers and directly processing screenshots to issue mouse‑keyboard commands. Developers can set system messages to define risk tolerance and confirmation policies, allowing high‑risk actions (e.g., file deletion) to require explicit approval.

OpenAI also released the Playwright Interactive skill, enabling visual debugging while building web or Electron apps.

Tool Search: Solving Agent Scaling Bottlenecks

Previously, agents had to embed all tool definitions in the prompt, wasting tens of thousands of tokens and causing cache churn. Tool Search introduces a lightweight tool list plus a search capability that fetches definitions only when needed.

In a test of 250 MCP Atlas tasks, enabling Tool Search reduced token consumption by 47% with unchanged accuracy.

"tool search is the overlooked line that finally prevents agents from crashing when the tool count grows beyond twenty."

Developer Demos

Within hours of release, developers posted videos showing GPT‑5.4 building a game, launching a browser to test it, and iterating without human intervention.

One demo used Playwright MCP to let the model open a game, read screenshots, and issue actions autonomously.

Another collection presented seven end‑to‑end tasks—including a playable 3D board game—demonstrating that the 1 M token context eliminates the need for manual context chopping.

Programming Ability vs. Claude Opus 4.6

Community tests compared GPT‑5.4 to Claude Code Opus 4.6 on an eight‑stage macOS app project. GPT‑5.4 completed all stages within an hour, while Claude stalled after the second stage.

Although this is a single anecdotal test, it suggests GPT‑5.4 excels at long‑flow, multi‑tool, iterative tasks rather than isolated queries.

The API also offers a /fast mode that can boost token‑per‑second throughput by up to 1.5×.

Deep‑Tester Feedback

Matt Shumer, an early‑access tester, described GPT‑5.4 as the best model currently available, noting that the choice of model becomes less critical.

Standard version outperforms Pro – Heavy Pro users switched to the standard tier, reducing daily costs.

Coding is essentially solved – Stability in Codex makes further improvements marginal.

Inference tokens are cheaper – Same‑quality results use fewer tokens and run faster.

Drawbacks

Front‑end visual aesthetics lag behind Claude Opus 4.6 and Gemini 3.1 Pro.

The model can miss implicit real‑world knowledge (e.g., overlooking spring‑break crowding in travel planning).

Some community members caution that the 57% SWE‑Bench Pro score does not imply all programming problems are solved and note potential bias from early‑access collaborations.

New ChatGPT Experience

GPT‑5.4 Thinking introduces “mid‑stream interruption,” allowing users to inject new instructions while the model is reasoning, eliminating the need to wait for a full response before correcting direction.

The model also provides an initial outline or work plan before execution, mirroring Codex’s transparent workflow style.

BrowseComp performance improves by 17% absolute on multi‑source synthesis queries.

Professional Work Performance

In internal spreadsheet benchmarks, GPT‑5.4 scores 87.3% versus GPT‑5.2’s 68.4%.

For PPT creation, human evaluators prefer GPT‑5.4’s output 68% of the time, citing richer visuals and varied imagery.

A new ChatGPT for Excel plugin integrates financial data sources such as Factiva, Daloopa, and S&P Global, targeting finance analysts.

Hallucination Reduction

On a user‑flagged factual‑error set, GPT‑5.4’s single‑statement error rate is 33% lower than GPT‑5.2, and the probability of a full answer containing any error drops by 18%.

How Developers Can Access the New Features

Model names for the API: gpt-5.4 – standard version gpt-5.4-pro – Pro version

Key new API parameters: computer – enables the Computer Use tool tool_search – activates the on‑demand tool retrieval mechanism

Image input now supports two detail levels: original – up to 10.24 M pixels or 6000‑pixel edge length (whichever is smaller) high – up to 2.56 M pixels, improving localization and click accuracy.

References

OpenAI blog: https://openai.com/index/introducing-gpt-5-4/

GPT‑5.4 API docs: https://developers.openai.com/api/docs/guides/latest-model

Tool Search guide: https://developers.openai.com/api/docs/guides/tools-tool-search

Playwright Interactive skill: https://github.com/openai/skills/tree/main/skills/.curated/playwright-interactive

Agent benchmark AI model tool search GPT-5.4 computer use large context

Written by

ShiZhen AI

Tech blogger with over 10 years of experience at leading tech firms, AI efficiency and delivery expert focusing on AI productivity. Covers tech gadgets, AI-driven efficiency, and leisure— AI leisure community. 🛰 szzdzhp001

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

What Is GPT-5.4?

Five Core Capabilities

Benchmark Overview

Computer Use in Practice

Tool Search: Solving Agent Scaling Bottlenecks

Developer Demos

Programming Ability vs. Claude Opus 4.6

Deep‑Tester Feedback

New ChatGPT Experience

Professional Work Performance

Hallucination Reduction

How Developers Can Access the New Features

References

ShiZhen AI

How this landed with the community

Was this worth your time?

0 Comments

Programming Ability vs. Claude Opus 4.6