How accurate are AI chatbots at medical diagnosis?

According to a JAMA Network Open study published in April 2026, leading AI chatbots fail to produce appropriate differential diagnoses more than 80% of the time when presented with incomplete patient data. However, they achieve over 90% accuracy on final diagnoses when given complete clinical information.

Which AI models were tested in the JAMA medical diagnosis study?

The study tested 21 large language models including ChatGPT, DeepSeek, Claude, Gemini, and Grok. PrIME-LLM scores ranged from 64% for Gemini 1.5 Flash to 78% for Grok 4 and GPT-5.

Can AI replace doctors for medical diagnosis?

Not yet. The study's authors concluded that AI models require physician oversight and should augment rather than replace clinical reasoning, as they struggle with the early-stage differential diagnosis process that is central to the art of medicine.

AI Chatbots Fail Over 80% of Early Medical Diagnoses, JAMA Study Finds

A sweeping new study published in JAMA Network Open has delivered a sobering verdict on AI's readiness for the exam room: all 21 frontier large language models tested failed to produce appropriate differential diagnoses more than 80% of the time when working with incomplete patient information.

The research, led by Mass General Brigham's MESH Incubator, exposes a critical gap between AI's pattern-matching prowess and the nuanced clinical reasoning that physicians perform every day.

How the Study Worked

Researchers led by Arya Rao, an MD-PhD student at Harvard Medical School, presented 29 published clinical vignettes to 21 leading AI models — including ChatGPT, DeepSeek, Claude, Gemini, and Grok. Rather than handing the models complete case files, the team fed information in the sequence a real doctor would encounter it: first the patient's age, gender, and symptoms, then physical examination findings, and finally laboratory results and imaging.

This stepwise approach mirrors how diagnoses actually unfold in clinical settings, where physicians must generate a working list of candidate conditions — known as a differential diagnosis — long before all test results arrive.

The Diagnostic Paradox

The results revealed a striking paradox. When given complete clinical information, the best-performing models achieved over 90% accuracy on final diagnoses. But during the critical early stages, when a differential diagnosis matters most for guiding the right tests and treatments, every model tested failed more than 80% of the time.

PrIME-LLM scores across models ranged from 64% for Gemini 1.5 Flash to 78% for Grok 4 and GPT-5, indicating that while some models perform better than others, none have cracked the challenge of reasoning under uncertainty.

"Differential diagnoses are central to clinical reasoning and underlie the 'art of medicine' that AI cannot currently replicate," said Dr. Marc Succi, Executive Director of the MESH Incubator and the study's corresponding author.

Why This Matters

The findings carry significant implications as healthcare systems worldwide race to integrate AI tools into clinical workflows. AI-powered diagnostic assistants are already being marketed to hospitals and clinics, and patients increasingly turn to chatbots like ChatGPT for medical advice.

"These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn't much information," said lead author Rao. That early-stage reasoning is precisely where diagnostic errors are most consequential — missed or delayed diagnoses remain a leading cause of patient harm.

The Path Forward

The study's authors stopped short of dismissing AI's role in medicine entirely. They noted that recent models show incremental improvement over older versions, suggesting the technology is advancing. However, they concluded that current large language models require a "human in the loop" — physician oversight remains essential for safe clinical deployment.

The message is clear: AI should augment physician reasoning, not replace it. Until models can navigate the ambiguity inherent in early-stage diagnosis, the stethoscope stays firmly in human hands.

AI Chatbots Fail Over 80% of Early Medical Diagnoses, JAMA Study Finds

How the Study Worked

The Diagnostic Paradox

Why This Matters

The Path Forward

More in Research

Northwestern's Printed Artificial Neurons Talk Back to Living Brain Cells

Honor's Autonomous Humanoid Robot Wins Beijing Half-Marathon in 50:26, Outpacing Human World Record

Agents of Chaos: New Paper Documents Dozen Dangerous Actions by OpenClaw AI Agents