A sweeping new study published in JAMA Network Open has delivered a sobering verdict on AI's readiness for the exam room: all 21 frontier large language models tested failed to produce appropriate differential diagnoses more than 80% of the time when working with incomplete patient information.
The research, led by Mass General Brigham's MESH Incubator, exposes a critical gap between AI's pattern-matching prowess and the nuanced clinical reasoning that physicians perform every day.
How the Study Worked
Researchers led by Arya Rao, an MD-PhD student at Harvard Medical School, presented 29 published clinical vignettes to 21 leading AI models — including ChatGPT, DeepSeek, Claude, Gemini, and Grok. Rather than handing the models complete case files, the team fed information in the sequence a real doctor would encounter it: first the patient's age, gender, and symptoms, then physical examination findings, and finally laboratory results and imaging.
This stepwise approach mirrors how diagnoses actually unfold in clinical settings, where physicians must generate a working list of candidate conditions — known as a differential diagnosis — long before all test results arrive.
The Diagnostic Paradox
The results revealed a striking paradox. When given complete clinical information, the best-performing models achieved over 90% accuracy on final diagnoses. But during the critical early stages, when a differential diagnosis matters most for guiding the right tests and treatments, every model tested failed more than 80% of the time.
PrIME-LLM scores across models ranged from 64% for Gemini 1.5 Flash to 78% for Grok 4 and GPT-5, indicating that while some models perform better than others, none have cracked the challenge of reasoning under uncertainty.
"Differential diagnoses are central to clinical reasoning and underlie the 'art of medicine' that AI cannot currently replicate," said Dr. Marc Succi, Executive Director of the MESH Incubator and the study's corresponding author.
Why This Matters
The findings carry significant implications as healthcare systems worldwide race to integrate AI tools into clinical workflows. AI-powered diagnostic assistants are already being marketed to hospitals and clinics, and patients increasingly turn to chatbots like ChatGPT for medical advice.
"These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn't much information," said lead author Rao. That early-stage reasoning is precisely where diagnostic errors are most consequential — missed or delayed diagnoses remain a leading cause of patient harm.
The Path Forward
The study's authors stopped short of dismissing AI's role in medicine entirely. They noted that recent models show incremental improvement over older versions, suggesting the technology is advancing. However, they concluded that current large language models require a "human in the loop" — physician oversight remains essential for safe clinical deployment.
The message is clear: AI should augment physician reasoning, not replace it. Until models can navigate the ambiguity inherent in early-stage diagnosis, the stethoscope stays firmly in human hands.



