Back to stories
Research

ARC-AGI-3 Humiliates Every Frontier AI Model — Humans Still Win

Michael Ouroumis3 min read
ARC-AGI-3 Humiliates Every Frontier AI Model — Humans Still Win

Every few months, an AI lab claims its latest model is "approaching human-level reasoning." The ARC-AGI-3 benchmark just called their bluff — and the results are devastating.

The third iteration of the Abstraction and Reasoning Corpus, published by ARC Prize, tested every major frontier AI model on a set of visual pattern-matching and reasoning tasks. The results: not a single model broke 1%. Gemini 3.1 Pro led the pack with a score of 0.37%. The same tasks were solved by 100% of human participants on their first attempt.

The Scores

The full leaderboard reads like a humiliation parade for the AI industry:

To put this in perspective: every human test subject — regardless of age, education, or technical background — solved these problems correctly the first time they tried. The gap between human performance (100%) and the best AI performance (0.37%) isn't a gap. It's a chasm.

What ARC-AGI-3 Actually Tests

The ARC benchmark series, created by AI researcher and Keras creator François Chollet, is specifically designed to measure abstract reasoning — the ability to identify patterns in novel situations and apply flexible rules that were never seen during training.

Each task presents a small grid of colored cells with an input-output pattern. The test taker must figure out the underlying rule and apply it to a new input. The tasks are trivially easy for humans because they require the kind of basic abstraction that comes naturally to human cognition: spatial reasoning, object persistence, symmetry detection, and rule inference.

ARC-AGI-3 raises the difficulty from previous iterations by introducing more compositional reasoning — tasks that require chaining multiple abstract rules together. But "more difficult" is relative. Humans still found them straightforward.

What This Means for AGI Claims

The benchmark's creator has been blunt about the implications. Current AI models, regardless of scale, are fundamentally limited in their ability to reason abstractly about novel problems. They can simulate reasoning on tasks similar to their training data, but when confronted with genuinely new patterns, they collapse.

This directly challenges the narrative from major AI labs that scaling — more parameters, more data, more compute — is a reliable path to artificial general intelligence. ARC-AGI-3 suggests that the gap between statistical pattern matching and genuine reasoning isn't closing with scale. It may require entirely different approaches.

The Industry Response

AI labs have been notably quiet about ARC-AGI-3. None of the major model providers issued statements about the benchmark results. This silence contrasts sharply with the eager press releases that typically accompany favorable benchmark scores.

Some researchers have pushed back on the benchmark's relevance, arguing that abstract visual reasoning is just one dimension of intelligence and that current models excel at many tasks humans find difficult. That's true — but it misses the point. ARC-AGI-3 isn't measuring whether AI can do hard things. It's measuring whether AI can do easy things that require actual understanding.

The Uncomfortable Truth

ARC-AGI-3 doesn't prove AGI is impossible. It proves that we don't have it yet, and that the current paradigm of large language models isn't obviously converging toward it. The tasks that stump billion-dollar AI systems are the same ones a child can solve by looking at a picture for ten seconds.

That's not a benchmark failure. That's a reality check.

Learn AI for Free — FreeAcademy.ai

Take "AI Essentials: Understanding AI in 2026" — a free course with certificate to master the skills behind this story.

More in Research

Anthropic's Mythos Is Finding Bugs Faster Than Open-Source Teams Can Patch Them
Research

Anthropic's Mythos Is Finding Bugs Faster Than Open-Source Teams Can Patch Them

Bloomberg reporting this week highlights a lopsided new reality: Anthropic's Mythos model has surfaced thousands of high- and critical-severity vulnerabilities across major operating systems and browsers, but fewer than 1% have been patched by maintainers.

13 hours ago3 min read
Physical Intelligence's π0.7 Robot Brain Teaches Itself Tasks It Was Never Trained On
Research

Physical Intelligence's π0.7 Robot Brain Teaches Itself Tasks It Was Never Trained On

Physical Intelligence's new π0.7 model shows early signs of compositional generalization, letting robots fold laundry and operate new kitchen appliances without task-specific training data.

14 hours ago3 min read
Anthropic Refuses to Fix MCP Flaw Putting 200,000 Servers at Risk
Research

Anthropic Refuses to Fix MCP Flaw Putting 200,000 Servers at Risk

OX Security researchers disclosed a systemic design flaw in Anthropic's Model Context Protocol affecting 150M+ downloads and roughly 200,000 servers. Anthropic declined to modify the architecture, calling the behavior expected.

22 hours ago3 min read