Google DeepMind's AlphaProof Nexus has generated machine-verified proofs for 9 of 353 open problems in the Erdős catalog — two of which had stood unsolved for 56 years — alongside 44 of 492 open conjectures from the Online Encyclopedia of Integer Sequences (OEIS). According to the team's arXiv preprint (2605.22763), posted May 21, each solution cost only a few hundred dollars in inference, and every proof was checked in the Lean formal proof assistant rather than left to human referees.
Gemini 3.1 Pro inside a Lean verification loop
The system pairs Gemini 3.1 Pro with Lean, compiling each candidate argument into a machine-checkable formal proof. That design choice is the headline for practitioners: instead of emitting natural-language arguments that need expensive expert review — and that can hide subtle errors — the agent only counts a problem as solved once Lean confirms the proof type-checks. DeepMind has published the Lean artifacts and selected natural-language write-ups in a public GitHub repository, making the results independently reproducible.
DeepMind veteran David Silver — who led the lab's reinforcement-learning research through January 2026 — has framed mathematics as an ideal proving ground for this approach because it is fully digital, self-verifying, and amenable to experience-driven improvement loops — the formal checker supplies an unambiguous reward signal that most agentic domains lack.
A pointed contrast with OpenAI
The release landed roughly a day after OpenAI publicized its own Erdős result, and the framing was deliberate. OpenAI's earlier claim drew criticism that its model had surfaced existing references to already-solved problems rather than constructing anything new; Demis Hassabis reportedly called that episode "embarrassing." DeepMind's pitch is that Lean-checked novelty removes the ambiguity — a proof either compiles or it does not.
Why the verification loop matters for builders
The technical lesson generalizes well beyond number theory. The bottleneck in agentic reasoning is rarely generating plausible output; it is trusting it. AlphaProof Nexus shows what happens when you wrap a frontier model in a hard, automated verifier: the system can grind through hundreds of attempts, discard the ones that fail the checker, and ship only what is provably correct — at a cost low enough, a few hundred dollars per result, to run at scale.
For teams building agents in any domain with a formal oracle — theorem provers, type systems, test suites, constraint solvers, query planners — the takeaway is that verifier-in-the-loop architectures are now producing research-grade output, not toy demos. The economics reframe inference spend, too: a few hundred dollars to settle a 56-year-old open problem is a strong argument for pointing compute at verifiable reasoning rather than open-ended generation.



