Why GPT‑5.5 and Claude Opus 4.7 Score Below 1% on ARC‑AGI‑3 While Humans Achieve 100%
The ARC‑AGI‑3 benchmark shows that GPT‑5.5 (0.43%) and Claude Opus 4.7 (0.18%) fail to solve any of the 135 novel environments, whereas a six‑year‑old human solves them all, and the analysis attributes the gap to three concrete failure modes and differing compression abilities of the two models.
ARC‑AGI‑3, the latest benchmark created by François Chollet, consists of 135 hand‑crafted environments that test an AI system’s ability to act efficiently on first exposure without any prior training or instructions. Human participants solved every environment (100% success), while the two leading large‑language models—OpenAI’s GPT‑5.5 and Anthropic’s Claude Opus 4.7—scored below 1% (0.43% and 0.18% respectively).
The analysis by the ARC Prize team, based on 160 complete execution traces, identifies three core failure patterns that explain why the models collapse.
1. Real‑time local feedback but no global world model
Both models can recognize that a specific action changes the environment (e.g., pressing a key rotates an object) but cannot abstract this into a universal rule that guides subsequent planning. In task cd82, Claude Opus notices that ACTION3 rotates a container and that ACTION5 pours paint, yet it never integrates these observations into the strategy “rotate the bucket before painting”. The failure is not blindness but the inability to synthesize observations into a coherent world model.
2. Training‑data‑driven abstract mis‑alignment
The models mistakenly map novel ARC‑AGI‑3 tasks onto familiar games learned from their training data (e.g., Tetris, Frogger, Sokoban). This “data‑anchor” effect leads to locally plausible but globally wrong hypotheses. For instance, GPT‑5.5 interprets task cd82 as a “sand‑filling” game and misclassifies task ls20 as a “brick‑breaker” scenario, causing actions that follow the wrong game logic.
3. Passing a level without learning the underlying rule
Success on an early level can mask a missing or distorted understanding of the core mechanics, which then collapses on later levels. Claude Opus completes Level 1 of task ka59 in 37 steps but assumes that “click” teleports the character, a misconception that leads to failure on Level 2 where precise shape‑matching is required. Similarly, in task ar25 the model discovers the mirror‑movement rule in Level 1 but later hallucinates nonexistent rules such as “drilling” in Level 2.
Different "compression" styles
Claude Opus exhibits stronger short‑term mechanism discovery but tends to compress observations into an over‑confident yet incorrect theory. GPT‑5.5 generates a broader set of hypotheses, which sometimes includes the correct idea, but it struggles to compress them into decisive actions. This contrast is described as Opus being an “over‑confident intuitivore” and GPT‑5.5 a “divergent theorist”.
Overall, the ARC‑AGI‑3 results expose a fundamental gap: current frontier models, despite billions of parameters and massive compute, still lack the ability to build robust, generalizable world models and to reliably translate discovered rules into consistent behavior, underscoring the long road ahead for AGI.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
