GPT‑5.5 Beats Claude on the Zero‑Score Programming Benchmark
GPT‑5.5’s high and ultra‑high inference modes achieve the first perfect pass on the notoriously hard ProgramBench programming benchmark, surpassing Claude Opus 4.7 across all core metrics, while detailed cost and failure analyses reveal why lower‑cost settings still stumble.
ProgramBench benchmark
ProgramBench is a reconstruction benchmark that gives an agent only a compiled executable and a documentation file. The test system forbids source‑code access, decompilation, and any network connection. The agent must infer the program’s behavior, design its own probing methods, choose a programming language, write all source code and build scripts, and submit a program that passes every generated test.
The benchmark contains 200 real‑world tasks, ranging from tiny text‑processing tools such as jq and ripgrep to large systems like a PHP compiler, the FFmpeg multimedia framework, and the SQLite relational database. For each task the backend runs an agent‑driven fuzzing harness that creates more than 248 000 targeted test cases. A task is counted as solved only if the submitted program passes the entire test suite; a single edge‑case failure marks the task as unsolved.
GPT‑5.5 evaluation
Four inference configurations were measured.
Default medium mode : cost $1.04, minimal API usage, but the generated argument parser was hand‑written and missed edge‑case handling, leading to failures in color‑argument parsing and non‑blocking file‑descriptor reads.
High inference mode (high) : cost $3.17, 34 API calls. The model performed ten exploration rounds, probing more than forty flag combinations to learn the program’s command‑line behavior. It then emitted a complete C implementation, applying only five precise patches before the final submission. All tests passed, achieving a perfect score on the cmatrix instance.
Ultra‑high inference mode (xhigh) : also cost $3.17, 34 API calls (same budget as high mode). Instead of C, the model chose Python. It executed a meticulous 27‑step testing sequence that captured every command‑line nuance, then produced a single, fully independent Python source file that passed all tests without failure.
Overall task success : In high and ultra‑high modes GPT‑5.5 solved 26 tasks with a unit‑test pass rate of ≥95 %, a historic increase compared with earlier models. The cumulative score histogram shows GPT‑5.5 dominating every score threshold, as well as average, median, and 90 %/50 % pass‑rate metrics.
Claude Opus 4.7 ultra‑high mode
Cost $10.74, 178 API calls, 19 failures. The failures stemmed from two low‑level logic bugs:
Case‑sensitive string comparison for color arguments caused all uppercase or mixed‑case inputs to be rejected.
Exit‑code handling conflated distinct error conditions: the original program returned status 0 for an invalid color but status 1 for a graphics‑library initialization failure, yet the model’s code always exited with status 1, masking the difference and causing eight test cases to fail.
Reconstruction details for the cmatrix task
High mode workflow – The agent first read the documentation, then iteratively sent probes to the executable. Over ten rounds it explored more than forty flag combinations, recorded exit codes and error messages, and detected the absence of the ncurses header in the environment. Based on this information it generated a single‑file C program, applying five targeted patches to achieve a flawless test run.
Ultra‑high mode workflow – After the same initial probing phase, the agent performed a 27‑step verification that examined every subtle command‑line path. It then emitted a self‑contained Python script that reproduced the exact behavior of the original binary and passed all generated tests.
Failure analysis of GPT‑5.5 default mode
The hand‑written argument parser mishandled the double‑dash (“--”) terminator, treating it as an unknown option and prematurely printing help text, which stopped further processing. Additionally, the model used a character‑reading function on a non‑blocking file descriptor; when no input was available the function returned EOF, causing the program to permanently close standard input and break the screen‑saver key‑press detection.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
SuanNi
A community for AI developers that aggregates large-model development services, models, and compute power.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
