Back to stories
Research

AI Chatbots Fail Over 80% of Early Medical Diagnoses, JAMA Study Finds

Michael Ouroumis2 min read
AI Chatbots Fail Over 80% of Early Medical Diagnoses, JAMA Study Finds

A sweeping new study published in JAMA Network Open has delivered a sobering verdict on AI's readiness for the exam room: all 21 frontier large language models tested failed to produce appropriate differential diagnoses more than 80% of the time when working with incomplete patient information.

The research, led by Mass General Brigham's MESH Incubator, exposes a critical gap between AI's pattern-matching prowess and the nuanced clinical reasoning that physicians perform every day.

How the Study Worked

Researchers led by Arya Rao, an MD-PhD student at Harvard Medical School, presented 29 published clinical vignettes to 21 leading AI models — including ChatGPT, DeepSeek, Claude, Gemini, and Grok. Rather than handing the models complete case files, the team fed information in the sequence a real doctor would encounter it: first the patient's age, gender, and symptoms, then physical examination findings, and finally laboratory results and imaging.

This stepwise approach mirrors how diagnoses actually unfold in clinical settings, where physicians must generate a working list of candidate conditions — known as a differential diagnosis — long before all test results arrive.

The Diagnostic Paradox

The results revealed a striking paradox. When given complete clinical information, the best-performing models achieved over 90% accuracy on final diagnoses. But during the critical early stages, when a differential diagnosis matters most for guiding the right tests and treatments, every model tested failed more than 80% of the time.

PrIME-LLM scores across models ranged from 64% for Gemini 1.5 Flash to 78% for Grok 4 and GPT-5, indicating that while some models perform better than others, none have cracked the challenge of reasoning under uncertainty.

"Differential diagnoses are central to clinical reasoning and underlie the 'art of medicine' that AI cannot currently replicate," said Dr. Marc Succi, Executive Director of the MESH Incubator and the study's corresponding author.

Why This Matters

The findings carry significant implications as healthcare systems worldwide race to integrate AI tools into clinical workflows. AI-powered diagnostic assistants are already being marketed to hospitals and clinics, and patients increasingly turn to chatbots like ChatGPT for medical advice.

"These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn't much information," said lead author Rao. That early-stage reasoning is precisely where diagnostic errors are most consequential — missed or delayed diagnoses remain a leading cause of patient harm.

The Path Forward

The study's authors stopped short of dismissing AI's role in medicine entirely. They noted that recent models show incremental improvement over older versions, suggesting the technology is advancing. However, they concluded that current large language models require a "human in the loop" — physician oversight remains essential for safe clinical deployment.

The message is clear: AI should augment physician reasoning, not replace it. Until models can navigate the ambiguity inherent in early-stage diagnosis, the stethoscope stays firmly in human hands.

Learn AI for Free — FreeAcademy.ai

Take "AI Essentials: Understanding AI in 2026" — a free course with certificate to master the skills behind this story.

More in Research

Northwestern's Printed Artificial Neurons Talk Back to Living Brain Cells
Research

Northwestern's Printed Artificial Neurons Talk Back to Living Brain Cells

Northwestern engineers have printed soft, flexible artificial neurons that can activate living brain tissue, a Nature Nanotechnology result that points toward a new generation of brain-machine interfaces and brain-like computing hardware.

1 day ago3 min read
Honor's Autonomous Humanoid Robot Wins Beijing Half-Marathon in 50:26, Outpacing Human World Record
Research

Honor's Autonomous Humanoid Robot Wins Beijing Half-Marathon in 50:26, Outpacing Human World Record

A humanoid robot running autonomously for Chinese smartphone maker Honor crossed the finish line of Beijing's E-Town half-marathon in 50 minutes and 26 seconds on Sunday, a time faster than the men's human world record of 57:20.

1 day ago2 min read
Agents of Chaos: New Paper Documents Dozen Dangerous Actions by OpenClaw AI Agents
Research

Agents of Chaos: New Paper Documents Dozen Dangerous Actions by OpenClaw AI Agents

A 20-researcher study titled 'Agents of Chaos' documented roughly a dozen dangerous actions by autonomous AI agents, from deleting email inboxes to leaking medical and financial records — fueling a wider expert warning on April 19 about the cybersecurity risks of the agentic AI boom.

1 day ago3 min read