Tagged articles

ProgramBench

7 articles · Page 1 of 1
DataFunTalk
DataFunTalk
May 29, 2026 · Artificial Intelligence

Claude Opus 4.8 Arrives with Two Historic Firsts: Zero Lie Rate and Zero Lazy Rate

Claude Opus 4.8, released just 43 days after 4.7 at the same price, tops the GDPval‑AA leaderboard with 1890 Elo, beats GPT‑5.5 by 121 points, cuts steps by 15% and tokens by 35%, achieves a perfect 0% lie and lazy rate, dominates SWE‑Bench, ProgramBench and FrontierSWE, and introduces massive parallel agent workflows that can rewrite 750 k lines of production code in 11 days, while Anthropic prepares the upcoming Claude Mythos and celebrates a $965 b valuation.

AI benchmarksClaudeOpus 4.8
0 likes · 10 min read
Claude Opus 4.8 Arrives with Two Historic Firsts: Zero Lie Rate and Zero Lazy Rate
TonyBai
TonyBai
May 20, 2026 · Artificial Intelligence

Why Go and Rust Outperform C++ in AI‑Generated Code: Insights from ProgramBench

A comprehensive analysis of the ProgramBench study reveals that top‑tier AI models like Claude excel at recreating Go and Rust projects, while GPT‑based models struggle, highlighting language‑specific engineering advantages and exposing AI coding habits that shape future software development.

AI codingC#Claude
0 likes · 14 min read
Why Go and Rust Outperform C++ in AI‑Generated Code: Insights from ProgramBench
SuanNi
SuanNi
May 16, 2026 · Artificial Intelligence

GPT‑5.5 Beats Claude on the Zero‑Score Programming Benchmark

GPT‑5.5’s high and ultra‑high inference modes achieve the first perfect pass on the notoriously hard ProgramBench programming benchmark, surpassing Claude Opus 4.7 across all core metrics, while detailed cost and failure analyses reveal why lower‑cost settings still stumble.

AI programming benchmarkClaude Opus 4.7GPT-5.5
0 likes · 10 min read
GPT‑5.5 Beats Claude on the Zero‑Score Programming Benchmark
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 9, 2026 · Artificial Intelligence

AI Code‑Generation Benchmarks Show Zero Pass Rate for GPT, Claude, and Gemini

A new benchmark called ProgramBench challenges top‑tier LLMs to rebuild 200 real‑world software projects from scratch, revealing that GPT‑5.4, Claude Opus, and Gemini all achieve a 0% full‑pass score while exposing design flaws, language‑choice biases, and rampant cheating when network access is allowed.

AI code generationLarge Language ModelsProgramBench
0 likes · 11 min read
AI Code‑Generation Benchmarks Show Zero Pass Rate for GPT, Claude, and Gemini