Humans Achieve 100% While Top AI Models Score Below 0.4% on ARC‑AGI‑3 Benchmark

In the ARC‑AGI‑3 test, 486 random humans solved all 150+ game‑based puzzles with a perfect 100% success rate in a median of 7.4 minutes, whereas leading models such as GPT‑5, Claude Opus 4.6, Gemini 3.1 Pro and Grok 4.20 managed at most 0.37%, exposing a stark gap in meta‑cognitive reasoning.

Lao Guo's Learning Space
Lao Guo's Learning Space
Lao Guo's Learning Space
Humans Achieve 100% While Top AI Models Score Below 0.4% on ARC‑AGI‑3 Benchmark

Last Wednesday a new AI benchmark called ARC‑AGI‑3 sparked a heated debate in the global AI community. The test asks participants to solve a series of interactive 64×64 grid games without any rules, goals, or dialogue, requiring pure observation and action.

Human Performance

486 people recruited randomly on the streets of San Francisco played the games and all achieved a perfect 100% success rate, with a median completion time of 7.4 minutes . Their success relied on innate reasoning abilities that do not depend on language or memory, namely:

Object permanence : understanding that objects continue to exist.

Geometric intuition : instantly perceiving patterns in shape transformations.

Causal inference : linking actions to observed effects.

Exploratory drive : learning by trial and error.

These capabilities appear in infants as young as six months.

AI Performance

When the same games were given to today’s leading large‑model AIs, the results were dramatically lower:

GPT‑5: 0.26%

Claude Opus 4.6: 0.25%

Grok 4.20: 0%

Gemini 3.1 Pro (preview): 0.37%

The scoring formula S = min(1, (human_steps ÷ AI_steps)²) penalizes AI that take many steps; a human solving a level in 10 steps while an AI needs 100 steps receives a score of only 1%.

Why Humans Succeed

Humans can observe, act, re‑observe, and act again until they infer the hidden rules, goals, and optimal paths. This iterative, model‑building process lets them master each of the 150+ original environments—each designed by professional game designers to be a completely novel “unknown unknown.”

Why Current AIs Fail

According to the ARC team, today’s AIs lack meta‑cognition —the ability to “know what they don’t know.” Models such as GPT‑5 and Claude tend to apply pre‑trained patterns, forming rigid hypotheses that they do not revise even when observations contradict them. Moreover, they are fundamentally instruction‑driven ; the ARC‑AGI‑3 environments provide no explicit commands, forcing the AI to explore and build a world model autonomously, a capability none of the evaluated large models possess.

Reinforcement‑Learning Outlier

A system named StochasticGoose achieved a score of 12.58% using a CNN + reinforcement learning architecture that learns by trial‑and‑error directly in the environment, rather than relying on massive text pre‑training.

This result revived the discussion that “old‑school” methods can outperform trillion‑parameter models on truly novel tasks.

Implications for AGI Claims

Recent hype—e.g., “o3 has achieved AGI” or “Claude rivals humans”—is challenged by ARC‑AGI‑3, which shows that solving familiar benchmarks does not equate to genuine intelligence. True intelligence should excel in completely unseen environments, learning efficiently without prior exposure.

The test’s design ensures that each run replaces the private dataset, eliminating any possibility of memorization.

Future Directions

The ARC Prize 2026 competition now offers a $850,000 prize pool (with $700,000 for a perfect solution) and requires all submissions to be open‑source. Current research avenues include:

Reinforcement learning with on‑policy training.

World‑model construction for predicting unseen environments.

Neuro‑symbolic integration to boost logical reasoning.

Meta‑cognitive architectures that can assess the validity of their own hypotheses.

These directions contrast sharply with the past two years’ focus on scaling model parameters.

Conclusion

While a company claims to have solved all ARC‑AGI‑3 levels, the prize remains unclaimed, underscoring the difficulty of the challenge. The test provides a clear, quantitative marker of how far current AI is from human‑level general intelligence.

Source: ARC‑AGI‑3 paper, ARC Prize Foundation, March 2026; data aggregated from XinZhiYuan, Tencent AI Pioneer, Industrial Intelligence Network.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsbenchmarkAGIReinforcement Learningmeta-cognitionARC-AGI-3
Lao Guo's Learning Space
Written by

Lao Guo's Learning Space

AI learning, discussion, and hands‑on practice with self‑reflection

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.