What scores did GPT-5.5 and Claude Opus 4.7 get on ARC-AGI-3?

Per the ARC Prize Foundation's analysis, GPT-5.5 scored 0.43% and Claude Opus 4.7 scored 0.18% on the ARC-AGI-3 benchmark.

What are the three systematic error patterns the analysis identified?

Models observe a local effect but fail to build a world model from it, they map unfamiliar environments onto known games like Tetris or Sokoban, and they sometimes beat a level by coincidence and carry the wrong theory into the next one.

ARC Prize Analysis: GPT-5.5 and Opus 4.7 Share Three Systematic Reasoning Errors on ARC-AGI-3

Q: How do GPT-5.5 and Opus 4.7 fail differently?

According to the report, 'Opus compressed its observations into a confident-but-wrong theory. GPT-5.5 had difficulty compressing at all' — Opus locks onto false patterns while GPT-5.5 struggles to commit to any single interpretation.

The ARC Prize Foundation has published a fresh analysis of why today's strongest frontier reasoning models still collapse on ARC-AGI-3, the foundation's interactive reasoning benchmark. The post, dated May 1, 2026, audits 160 replays and reasoning traces from OpenAI's GPT-5.5 and Anthropic's Claude Opus 4.7 — and finds that both stay under 1% with strikingly different failure profiles.

GPT-5.5 finished at 0.43%. Opus 4.7 came in at 0.18%. Those numbers are not new on their own; what is new is the pattern.

Three recurring failure modes

The foundation says three systematic errors explain almost all of the misses. The first is a world-model gap: models can perceive a local effect — for example, that pressing a particular action rotates an object — but cannot translate that observation into a global rule that governs the environment. One model correctly noticed that 'ACTION3 rotates the object,' yet missed that 'rotation controls which side gets a new value.'

The second is faulty abstraction from training. When confronted with an unfamiliar grid-world, models repeatedly map it onto something they have seen before, like Tetris, Sokoban, or Frogger, and then test affordances that do not apply.

The third is what the authors call success without understanding. A model can stumble through a level on a wrong theory and then carry that misconception forward, where it predictably breaks. As the analysis puts it, beating a level is not the same as understanding it.

Different failure styles

The sharpest line in the post is the contrast between the two models: 'Opus compressed its observations into a confident-but-wrong theory. GPT-5.5 had difficulty compressing at all.' Opus tends to lock in early on a clean — and often incorrect — explanation. GPT-5.5, by contrast, struggles to commit to any single interpretation and oscillates between hypotheses, leaving its reasoning traces scattered.

Both styles cap out at roughly the same place on the leaderboard, but the diagnosis is different, and the implied fixes are different too.

Why this matters beyond a benchmark

ARC-AGI-3 is intentionally novel and long-horizon — the kind of setting where you cannot win by pattern-matching to pretraining data. The foundation's argument is that score-only reporting hides this. Two models can post nearly identical near-zero numbers and yet be failing for completely different reasons, which means the engineering levers needed to push them up are not the same.

The report frames replay analysis as a missing layer in frontier evaluation: without it, labs end up tuning models against a number that does not tell them which capability is actually broken. Expect that argument to land in a debate that is already heating up around how much of the recent frontier gains reflect genuine reasoning versus more elaborate retrieval.

For builders, the practical takeaway is narrower. If you are deploying GPT-5.5 or Opus 4.7 into agentic workflows that require building a model of an unfamiliar environment from first principles, the ARC Prize data is a reminder that confident outputs in those settings can be exactly the wrong signal to trust.

ARC Prize Analysis: GPT-5.5 and Opus 4.7 Share Three Systematic Reasoning Errors on ARC-AGI-3

Three recurring failure modes

Different failure styles

Why this matters beyond a benchmark

More in Research

MIT's FTTE Cuts Federated Learning Time 81%, Brings AI Training to Smartwatches and Sensors

MIT's EnergAIzer Predicts AI Power Use in Seconds, Cuts Wasted Energy in Data Centers

The Reasoning Trap: ICLR 2026 Submission Finds Smarter LLMs Hallucinate More Tool Calls