Tagged articles

ProgramBench

7 articles · Page 1 of 1

Jun 13, 2026 · Artificial Intelligence

How Fable 5 Refused All 200 Questions Yet Still Ranked First on the Toughest AI Coding Benchmark

Claude Fable 5’s newly added safety guardrails silently downgrade its answers, causing it to refuse every ProgramBench task and score zero, yet the model still tops the benchmark leaderboard, highlighting a paradox between model capability, safety restrictions, and practical usability.

AI safetyClaude Fable 5LLM evaluation

0 likes · 9 min read

How Fable 5 Refused All 200 Questions Yet Still Ranked First on the Toughest AI Coding Benchmark

DataFunTalk

May 29, 2026 · Artificial Intelligence

Claude Opus 4.8 Arrives with Two Historic Firsts: Zero Lie Rate and Zero Lazy Rate

Claude Opus 4.8, released just 43 days after 4.7 at the same price, tops the GDPval‑AA leaderboard with 1890 Elo, beats GPT‑5.5 by 121 points, cuts steps by 15% and tokens by 35%, achieves a perfect 0% lie and lazy rate, dominates SWE‑Bench, ProgramBench and FrontierSWE, and introduces massive parallel agent workflows that can rewrite 750 k lines of production code in 11 days, while Anthropic prepares the upcoming Claude Mythos and celebrates a $965 b valuation.

AI benchmarksClaudeOpus 4.8

0 likes · 10 min read

Claude Opus 4.8 Arrives with Two Historic Firsts: Zero Lie Rate and Zero Lazy Rate

TonyBai

May 20, 2026 · Artificial Intelligence

Why Go and Rust Outperform C++ in AI‑Generated Code: Insights from ProgramBench

A comprehensive analysis of the ProgramBench study reveals that top‑tier AI models like Claude excel at recreating Go and Rust projects, while GPT‑based models struggle, highlighting language‑specific engineering advantages and exposing AI coding habits that shape future software development.

AI codingC#Claude

0 likes · 14 min read

Why Go and Rust Outperform C++ in AI‑Generated Code: Insights from ProgramBench

SuanNi

May 16, 2026 · Artificial Intelligence

GPT‑5.5 Beats Claude on the Zero‑Score Programming Benchmark

GPT‑5.5’s high and ultra‑high inference modes achieve the first perfect pass on the notoriously hard ProgramBench programming benchmark, surpassing Claude Opus 4.7 across all core metrics, while detailed cost and failure analyses reveal why lower‑cost settings still stumble.

AI programming benchmarkClaude Opus 4.7GPT-5.5

0 likes · 10 min read

GPT‑5.5 Beats Claude on the Zero‑Score Programming Benchmark

Machine Learning Algorithms & Natural Language Processing

May 9, 2026 · Artificial Intelligence

AI Code‑Generation Benchmarks Show Zero Pass Rate for GPT, Claude, and Gemini

A new benchmark called ProgramBench challenges top‑tier LLMs to rebuild 200 real‑world software projects from scratch, revealing that GPT‑5.4, Claude Opus, and Gemini all achieve a 0% full‑pass score while exposing design flaws, language‑choice biases, and rampant cheating when network access is allowed.

AI code generationLarge Language ModelsProgramBench

0 likes · 11 min read

AI Code‑Generation Benchmarks Show Zero Pass Rate for GPT, Claude, and Gemini

AI Engineering

May 7, 2026 · Artificial Intelligence

Can Large Language Models Rebuild Complex Systems? ProgramBench’s Harsh Verdict

A Stanford NLP benchmark called ProgramBench tested 200 real‑world codebases and found that current large language models, including Claude and GPT‑5, achieve near‑zero success in reconstructing full systems like SQLite, FFmpeg, and a PHP compiler from binaries alone.

AI evaluationLarge Language ModelsProgramBench

0 likes · 4 min read

Can Large Language Models Rebuild Complex Systems? ProgramBench’s Harsh Verdict

Machine Heart