GPT‑5.5 Beats Claude on the Zero‑Score Programming Benchmark

GPT‑5.5’s high and ultra‑high inference modes achieve the first perfect pass on the notoriously hard ProgramBench programming benchmark, surpassing Claude Opus 4.7 across all core metrics, while detailed cost and failure analyses reveal why lower‑cost settings still stumble.

SuanNi
SuanNi
SuanNi
GPT‑5.5 Beats Claude on the Zero‑Score Programming Benchmark

ProgramBench benchmark

ProgramBench is a reconstruction benchmark that gives an agent only a compiled executable and a documentation file. The test system forbids source‑code access, decompilation, and any network connection. The agent must infer the program’s behavior, design its own probing methods, choose a programming language, write all source code and build scripts, and submit a program that passes every generated test.

The benchmark contains 200 real‑world tasks, ranging from tiny text‑processing tools such as jq and ripgrep to large systems like a PHP compiler, the FFmpeg multimedia framework, and the SQLite relational database. For each task the backend runs an agent‑driven fuzzing harness that creates more than 248 000 targeted test cases. A task is counted as solved only if the submitted program passes the entire test suite; a single edge‑case failure marks the task as unsolved.

GPT‑5.5 evaluation

Four inference configurations were measured.

Default medium mode : cost $1.04, minimal API usage, but the generated argument parser was hand‑written and missed edge‑case handling, leading to failures in color‑argument parsing and non‑blocking file‑descriptor reads.

High inference mode (high) : cost $3.17, 34 API calls. The model performed ten exploration rounds, probing more than forty flag combinations to learn the program’s command‑line behavior. It then emitted a complete C implementation, applying only five precise patches before the final submission. All tests passed, achieving a perfect score on the cmatrix instance.

Ultra‑high inference mode (xhigh) : also cost $3.17, 34 API calls (same budget as high mode). Instead of C, the model chose Python. It executed a meticulous 27‑step testing sequence that captured every command‑line nuance, then produced a single, fully independent Python source file that passed all tests without failure.

Overall task success : In high and ultra‑high modes GPT‑5.5 solved 26 tasks with a unit‑test pass rate of ≥95 %, a historic increase compared with earlier models. The cumulative score histogram shows GPT‑5.5 dominating every score threshold, as well as average, median, and 90 %/50 % pass‑rate metrics.

Claude Opus 4.7 ultra‑high mode

Cost $10.74, 178 API calls, 19 failures. The failures stemmed from two low‑level logic bugs:

Case‑sensitive string comparison for color arguments caused all uppercase or mixed‑case inputs to be rejected.

Exit‑code handling conflated distinct error conditions: the original program returned status 0 for an invalid color but status 1 for a graphics‑library initialization failure, yet the model’s code always exited with status 1, masking the difference and causing eight test cases to fail.

Reconstruction details for the cmatrix task

High mode workflow – The agent first read the documentation, then iteratively sent probes to the executable. Over ten rounds it explored more than forty flag combinations, recorded exit codes and error messages, and detected the absence of the ncurses header in the environment. Based on this information it generated a single‑file C program, applying five targeted patches to achieve a flawless test run.

Ultra‑high mode workflow – After the same initial probing phase, the agent performed a 27‑step verification that examined every subtle command‑line path. It then emitted a self‑contained Python script that reproduced the exact behavior of the original binary and passed all generated tests.

Failure analysis of GPT‑5.5 default mode

The hand‑written argument parser mishandled the double‑dash (“--”) terminator, treating it as an unknown option and prematurely printing help text, which stopped further processing. Additionally, the model used a character‑reading function on a non‑blocking file descriptor; when no input was available the function returned EOF, causing the program to permanently close standard input and break the screen‑saver key‑press detection.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

model comparisonClaude Opus 4.7GPT-5.5ProgramBenchAI programming benchmarkhigh inference mode
SuanNi
Written by

SuanNi

A community for AI developers that aggregates large-model development services, models, and compute power.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.