11 min read

How Much Has GPT‑5.4 Improved? Hands‑On Test of Its Three Core Capabilities and Computer Control

After GPT‑5.4’s March release, the author benchmarks it against Claude Opus 4.6 and Gemini 3.1 Pro, evaluates its knowledge‑work, native computer‑control, and programming abilities through three hands‑on tasks—including data‑analysis, code‑base inspection, and a complex math‑modeling contest—revealing strong gains but still notable limitations.

Machine Learning Algorithms & Natural Language Processing

Mar 10, 2026

How Much Has GPT‑5.4 Improved? Hands‑On Test of Its Three Core Capabilities and Computer Control

On March 5, OpenAI released GPT‑5.4, positioning it directly against Claude Opus 4.6 (released February 5) and Gemini 3.1 Pro (released February 19). The author notes that GPT‑5.4 appears dramatically stronger and proceeds to compare the three flagship models across release dates, capabilities, and pricing.

A benchmark table shows that Anthropic leads in programming, Google leads in reasoning, while GPT‑5.4 sits in the middle on price. The most striking results are in “knowledge work” and “native computer control,” the two highlighted strengths of GPT‑5.4.

For knowledge work, the author cites the GDPval benchmark, which runs 44 real‑world professional scenarios. GPT‑5.4 matches or exceeds human experts in 83 % of the comparisons, indicating strong enterprise‑ready performance.

In native computer control, OSWorld reports a 75 % success rate, surpassing the human average of 72.4 %. GPT‑5.4 can generate Playwright scripts, take screenshots, and issue mouse‑keyboard commands without third‑party libraries.

Task 1: 2.62 M‑row data report (≈10 min)

请帮我完成以下电脑操作：（1）打开浏览器，访问 data.gov，下载"Consumer Complaint Database"的最新 CSV 数据集；（2）用本地 Python 打开这个文件；（3）进行数据清洗——去重、处理缺失值、标准化日期格式；（4）生成一份包含 5 个图表的分析报告（投诉趋势、公司排名、产品分类、州分布热力图、处理时效分布）；（5）把报告保存为 PDF。全程用电脑操控完成，不要只给我代码。

The model initially downloaded an empty file, detected the issue, switched to the CFPB API, and successfully retrieved the full dataset. After cleaning, it produced a PDF report with five charts in about ten minutes.

Task 2: 20 k‑line PySide6 project analysis

The author asked GPT‑5.4 (via Codex) to (1) draw a complete function‑call graph, (2) identify the three most performance‑critical functions, and (3) infer the original developer’s coding style. The model generated a clear Mermaid call graph, correctly pinpointed three redundant functions, and produced a plausible style assessment, demonstrating solid mid‑level code‑understanding.

Task 3: 2024 Math Modeling Contest C problem

The model built a mixed‑integer linear program with PuLP, drafted a full paper outline (abstract, assumptions, notation, modeling, solution, analysis, evaluation), but struggled with Windows PowerShell’s Chinese encoding, file‑name handling, and LaTeX syntax. After renaming files to ASCII, it completed the pipeline, delivering a 70‑point‑level solution but falling short of a top‑score paper.

Overall, the author concludes that GPT‑5.4’s coding ability is strong, its “human‑like” perception of code authors is impressive, yet its computer‑control capability still lags the advertised “out‑of‑the‑box” performance.

The Pro version, tested on a 20‑year‑experience AI‑industry recruiter interview and a CUDA‑installation query, shows higher quality output but at a cost of $180 per million tokens—over seven times Claude’s price—making it expensive for heavy‑use cases.

Finally, the author poses a strategic question: if you could focus GPT‑5.4 on one dimension for the next two quarters to secure irreversible user mindshare, which would you choose—long‑term agent task completion, high‑value knowledge‑work sign‑off rate, cross‑application context continuity, or per‑task cost?

benchmark coding assistance math modeling AI model evaluation GPT-5.4 computer control

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.