Back to stories
Research

Stanford Study: AI Tutoring Doubled Student Test Scores in Six Months

Michael Ouroumis3 min read
Stanford Study: AI Tutoring Doubled Student Test Scores in Six Months

A large-scale randomized controlled trial led by Stanford University's Graduate School of Education has produced the strongest evidence yet that AI tutoring systems can dramatically improve student outcomes. Students who used an AI tutor for 30 minutes daily over six months scored 2.1 times higher on standardized math assessments than a control group receiving only traditional instruction.

Study Design

The trial enrolled 4,200 middle school students across 38 schools in California, Texas, and Georgia. Half were randomly assigned to use Khanmigo, Khan Academy's AI tutor powered by GPT-4, for 30 minutes daily during a dedicated study period. The other half spent the same time on traditional practice worksheets and textbook exercises. Both groups attended identical regular math classes.

The study ran for the full 2025-2026 academic year, with assessments at baseline, three months, and six months. Researchers controlled for socioeconomic status, prior academic performance, school quality, and teacher experience.

Key Findings

The headline result is a 2.1x improvement in standardized math scores for the AI tutoring group compared to the control group after six months. But the details are more nuanced and arguably more interesting.

The effect was strongest among students who entered the study with below-average math scores. These students showed a 2.8x improvement, suggesting that AI tutoring is particularly effective for students who are behind. Students who started at or above grade level showed a more modest 1.4x improvement.

Engagement was another surprise. Students in the AI group completed 40% more practice problems than the control group, even though they were given the same amount of time. The researchers attribute this to the AI tutor's adaptive difficulty — it kept students in a productive challenge zone rather than giving them problems that were too easy or too hard.

How the AI Tutor Works

Khanmigo does not simply present problems and check answers. It uses Socratic questioning — asking students to explain their reasoning, identify errors, and work through misconceptions step by step. When a student is stuck, it provides hints rather than answers. When a student makes a systematic error, it identifies the underlying misconception and addresses it directly.

The system adapts in real time. If a student masters a concept quickly, it accelerates. If they struggle, it breaks the material into smaller steps and provides additional scaffolding.

Limitations

The researchers are careful to note limitations. The study measured math performance only. It did not assess other subjects where AI tutoring may be less effective. The 30-minute daily commitment required dedicated school time that many districts may struggle to provide. And the study period of six months does not show whether gains persist long-term.

Implications

The findings arrive as school districts across the country debate AI adoption. Several states have banned AI tools in classrooms, while others are piloting them aggressively. This study provides the most rigorous evidence to date that AI tutoring, when implemented as a supplement to human teaching, can produce meaningful improvements in student outcomes.

Khan Academy announced that it will make the study's full dataset available for independent replication.

Learn AI for Free — FreeAcademy.ai

Take "AI Essentials: Understanding AI in 2026" — a free course with certificate to master the skills behind this story.

More in Research

Anthropic's Mythos Is Finding Bugs Faster Than Open-Source Teams Can Patch Them
Research

Anthropic's Mythos Is Finding Bugs Faster Than Open-Source Teams Can Patch Them

Bloomberg reporting this week highlights a lopsided new reality: Anthropic's Mythos model has surfaced thousands of high- and critical-severity vulnerabilities across major operating systems and browsers, but fewer than 1% have been patched by maintainers.

13 hours ago3 min read
Physical Intelligence's π0.7 Robot Brain Teaches Itself Tasks It Was Never Trained On
Research

Physical Intelligence's π0.7 Robot Brain Teaches Itself Tasks It Was Never Trained On

Physical Intelligence's new π0.7 model shows early signs of compositional generalization, letting robots fold laundry and operate new kitchen appliances without task-specific training data.

14 hours ago3 min read
Anthropic Refuses to Fix MCP Flaw Putting 200,000 Servers at Risk
Research

Anthropic Refuses to Fix MCP Flaw Putting 200,000 Servers at Risk

OX Security researchers disclosed a systemic design flaw in Anthropic's Model Context Protocol affecting 150M+ downloads and roughly 200,000 servers. Anthropic declined to modify the architecture, calling the behavior expected.

22 hours ago3 min read