A large-scale randomized controlled trial led by Stanford University's Graduate School of Education has produced the strongest evidence yet that AI tutoring systems can dramatically improve student outcomes. Students who used an AI tutor for 30 minutes daily over six months scored 2.1 times higher on standardized math assessments than a control group receiving only traditional instruction.
Study Design
The trial enrolled 4,200 middle school students across 38 schools in California, Texas, and Georgia. Half were randomly assigned to use Khanmigo, Khan Academy's AI tutor powered by GPT-4, for 30 minutes daily during a dedicated study period. The other half spent the same time on traditional practice worksheets and textbook exercises. Both groups attended identical regular math classes.
The study ran for the full 2025-2026 academic year, with assessments at baseline, three months, and six months. Researchers controlled for socioeconomic status, prior academic performance, school quality, and teacher experience.
Key Findings
The headline result is a 2.1x improvement in standardized math scores for the AI tutoring group compared to the control group after six months. But the details are more nuanced and arguably more interesting.
The effect was strongest among students who entered the study with below-average math scores. These students showed a 2.8x improvement, suggesting that AI tutoring is particularly effective for students who are behind. Students who started at or above grade level showed a more modest 1.4x improvement.
Engagement was another surprise. Students in the AI group completed 40% more practice problems than the control group, even though they were given the same amount of time. The researchers attribute this to the AI tutor's adaptive difficulty — it kept students in a productive challenge zone rather than giving them problems that were too easy or too hard.
How the AI Tutor Works
Khanmigo does not simply present problems and check answers. It uses Socratic questioning — asking students to explain their reasoning, identify errors, and work through misconceptions step by step. When a student is stuck, it provides hints rather than answers. When a student makes a systematic error, it identifies the underlying misconception and addresses it directly.
The system adapts in real time. If a student masters a concept quickly, it accelerates. If they struggle, it breaks the material into smaller steps and provides additional scaffolding.
Limitations
The researchers are careful to note limitations. The study measured math performance only. It did not assess other subjects where AI tutoring may be less effective. The 30-minute daily commitment required dedicated school time that many districts may struggle to provide. And the study period of six months does not show whether gains persist long-term.
Implications
The findings arrive as school districts across the country debate AI adoption. Several states have banned AI tools in classrooms, while others are piloting them aggressively. This study provides the most rigorous evidence to date that AI tutoring, when implemented as a supplement to human teaching, can produce meaningful improvements in student outcomes.
Khan Academy announced that it will make the study's full dataset available for independent replication.



