Back to stories
Research

ARC-AGI-3 Humiliates Every Frontier AI Model — Humans Still Win

Michael Ouroumis3 min read
ARC-AGI-3 Humiliates Every Frontier AI Model — Humans Still Win

Every few months, an AI lab claims its latest model is "approaching human-level reasoning." The ARC-AGI-3 benchmark just called their bluff — and the results are devastating.

The third iteration of the Abstraction and Reasoning Corpus, published by ARC Prize, tested every major frontier AI model on a set of visual pattern-matching and reasoning tasks. The results: not a single model broke 1%. Gemini 3.1 Pro led the pack with a score of 0.37%. The same tasks were solved by 100% of human participants on their first attempt.

The Scores

The full leaderboard reads like a humiliation parade for the AI industry:

To put this in perspective: every human test subject — regardless of age, education, or technical background — solved these problems correctly the first time they tried. The gap between human performance (100%) and the best AI performance (0.37%) isn't a gap. It's a chasm.

What ARC-AGI-3 Actually Tests

The ARC benchmark series, created by AI researcher and Keras creator François Chollet, is specifically designed to measure abstract reasoning — the ability to identify patterns in novel situations and apply flexible rules that were never seen during training.

Each task presents a small grid of colored cells with an input-output pattern. The test taker must figure out the underlying rule and apply it to a new input. The tasks are trivially easy for humans because they require the kind of basic abstraction that comes naturally to human cognition: spatial reasoning, object persistence, symmetry detection, and rule inference.

ARC-AGI-3 raises the difficulty from previous iterations by introducing more compositional reasoning — tasks that require chaining multiple abstract rules together. But "more difficult" is relative. Humans still found them straightforward.

What This Means for AGI Claims

The benchmark's creator has been blunt about the implications. Current AI models, regardless of scale, are fundamentally limited in their ability to reason abstractly about novel problems. They can simulate reasoning on tasks similar to their training data, but when confronted with genuinely new patterns, they collapse.

This directly challenges the narrative from major AI labs that scaling — more parameters, more data, more compute — is a reliable path to artificial general intelligence. ARC-AGI-3 suggests that the gap between statistical pattern matching and genuine reasoning isn't closing with scale. It may require entirely different approaches.

The Industry Response

AI labs have been notably quiet about ARC-AGI-3. None of the major model providers issued statements about the benchmark results. This silence contrasts sharply with the eager press releases that typically accompany favorable benchmark scores.

Some researchers have pushed back on the benchmark's relevance, arguing that abstract visual reasoning is just one dimension of intelligence and that current models excel at many tasks humans find difficult. That's true — but it misses the point. ARC-AGI-3 isn't measuring whether AI can do hard things. It's measuring whether AI can do easy things that require actual understanding.

The Uncomfortable Truth

ARC-AGI-3 doesn't prove AGI is impossible. It proves that we don't have it yet, and that the current paradigm of large language models isn't obviously converging toward it. The tasks that stump billion-dollar AI systems are the same ones a child can solve by looking at a picture for ten seconds.

That's not a benchmark failure. That's a reality check.

Learn AI for Free — FreeAcademy.ai

Take "AI Essentials: Understanding AI in 2026" — a free course with certificate to master the skills behind this story.

More in Research

Palo Alto Networks: Frontier AI Models Surfaced 75 Vulnerabilities, 'New Norm' of AI Exploits 3-5 Months Away
Research

Palo Alto Networks: Frontier AI Models Surfaced 75 Vulnerabilities, 'New Norm' of AI Exploits 3-5 Months Away

Palo Alto Networks says tests with Anthropic's Mythos Preview and OpenAI's GPT-5.5-Cyber surfaced 75 vulnerabilities across 130+ products — about seven times its monthly average — and warns defenders have a 3-5 month window before AI-powered exploits become routine.

16 hours ago2 min read
Google Says It Found the First AI-Built Zero-Day Exploit in the Wild
Research

Google Says It Found the First AI-Built Zero-Day Exploit in the Wild

Google's Threat Intelligence Group says a prominent cybercrime group used AI to discover and weaponize a previously unknown 2FA-bypass flaw in a widely used open-source admin tool — the first AI-developed zero-day it has caught in a live campaign.

3 days ago2 min read
Google DeepMind Unveils 'AI Co-Mathematician' — and It Helps an Oxford Professor Crack an Open Problem
Research

Google DeepMind Unveils 'AI Co-Mathematician' — and It Helps an Oxford Professor Crack an Open Problem

Google DeepMind introduced a multi-agent AI system built on Gemini 3.1 that collaborates with research mathematicians, scoring 48% on FrontierMath Tier 4 and helping Oxford's Marc Lackenby resolve a long-open group-theory question.

3 days ago2 min read