Why Top LLMs Score 0% on the New ProgramBench: Engineering Intelligence’s Next Battleground
The newly released ProgramBench benchmark forces leading LLMs to rebuild full software projects from only usage docs, revealing a 0% full‑completion rate for Claude Opus, GPT‑5, Gemini and others, and exposing the gap between local code generation and true engineering intelligence.
The creators of SWE‑Bench have launched a "hellish" new benchmark called ProgramBench, which asks language models to reconstruct an entire, executable software system from a functional description and usage documentation, without access to the original source code, tests, or the internet.
Results are striking: the strongest first‑tier models—Claude Opus 4.7, GPT‑5.4, GPT‑5 mini, Gemini 3.1 Pro and Gemini 3 Flash—all achieve a 0% full‑completion rate. Even the best performer, Claude Opus 4.7, only completes 3% of tasks above a 95% threshold using the added “Almost” metric.
Unlike earlier code‑generation benchmarks that evaluate local abilities such as function completion, bug fixing, or feature implementation, ProgramBench measures behavioral equivalence. Models may use entirely different languages, algorithms, or architectures, provided the final input‑output behavior matches the reference program.
The benchmark also reveals a systematic bias: models favor monolithic, single‑file implementations with shallow directory structures, diverging sharply from human‑engineered code that separates concerns across files. For example, human projects typically place configuration in config.json, utilities in utils.py, database logic in db.py, and connect them via import statements, whereas models tend to collapse everything into one giant script.
Language‑specific analysis shows that traditional C/C++ projects attain the highest completion rates, while Rust projects perform the worst, highlighting models’ difficulty with the modular, ownership‑centric design patterns prevalent in Rust.
Models favor monolithic, single-file implementations that diverge sharply from human‑written code.
Critics argue that ProgramBench may simply test memorization of open‑source projects. Respondents acknowledge over‑fitting risks but maintain that the benchmark’s purpose is to push models toward higher‑level engineering intelligence, not to mimic average human ability. Limitations include the absence of agent‑driven harnesses (e.g., Claude Code, Codex), lack of fine‑grained progress metrics, and the enforced offline setting to prevent cheating.
The overarching conclusion is that the current bottleneck for AI‑assisted coding is not local code generation but long‑horizon software‑system construction. The industry is therefore shifting focus toward memory, agents, repo‑level reasoning, and autonomous software engineering as the next competitive frontier.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
