The ARC Prize Foundation has published a fresh analysis of why today's strongest frontier reasoning models still collapse on ARC-AGI-3, the foundation's interactive reasoning benchmark. The post, dated May 1, 2026, audits 160 replays and reasoning traces from OpenAI's GPT-5.5 and Anthropic's Claude Opus 4.7 — and finds that both stay under 1% with strikingly different failure profiles.
GPT-5.5 finished at 0.43%. Opus 4.7 came in at 0.18%. Those numbers are not new on their own; what is new is the pattern.
Three recurring failure modes
The foundation says three systematic errors explain almost all of the misses. The first is a world-model gap: models can perceive a local effect — for example, that pressing a particular action rotates an object — but cannot translate that observation into a global rule that governs the environment. One model correctly noticed that 'ACTION3 rotates the object,' yet missed that 'rotation controls which side gets a new value.'
The second is faulty abstraction from training. When confronted with an unfamiliar grid-world, models repeatedly map it onto something they have seen before, like Tetris, Sokoban, or Frogger, and then test affordances that do not apply.
The third is what the authors call success without understanding. A model can stumble through a level on a wrong theory and then carry that misconception forward, where it predictably breaks. As the analysis puts it, beating a level is not the same as understanding it.
Different failure styles
The sharpest line in the post is the contrast between the two models: 'Opus compressed its observations into a confident-but-wrong theory. GPT-5.5 had difficulty compressing at all.' Opus tends to lock in early on a clean — and often incorrect — explanation. GPT-5.5, by contrast, struggles to commit to any single interpretation and oscillates between hypotheses, leaving its reasoning traces scattered.
Both styles cap out at roughly the same place on the leaderboard, but the diagnosis is different, and the implied fixes are different too.
Why this matters beyond a benchmark
ARC-AGI-3 is intentionally novel and long-horizon — the kind of setting where you cannot win by pattern-matching to pretraining data. The foundation's argument is that score-only reporting hides this. Two models can post nearly identical near-zero numbers and yet be failing for completely different reasons, which means the engineering levers needed to push them up are not the same.
The report frames replay analysis as a missing layer in frontier evaluation: without it, labs end up tuning models against a number that does not tell them which capability is actually broken. Expect that argument to land in a debate that is already heating up around how much of the recent frontier gains reflect genuine reasoning versus more elaborate retrieval.
For builders, the practical takeaway is narrower. If you are deploying GPT-5.5 or Opus 4.7 into agentic workflows that require building a model of an unfamiliar environment from first principles, the ARC Prize data is a reminder that confident outputs in those settings can be exactly the wrong signal to trust.



