Back to stories
Research

Harvard Study: OpenAI's o1 Outperforms ER Doctors on Diagnosis Accuracy

Michael Ouroumis3 min read
Harvard Study: OpenAI's o1 Outperforms ER Doctors on Diagnosis Accuracy

A new Harvard Medical School study published in Science on April 30 — and amplified by tech and medical outlets over the weekend — found that OpenAI's o1 reasoning model matched or outperformed attending physicians across a battery of emergency department diagnostic and management tasks. The most striking gap appeared at the moment doctors typically have the least information to work with: the initial triage decision.

The research, led by senior authors Arjun Manrai (Harvard Medical School) and Adam Rodman (Beth Israel Deaconess Medical Center), with co-first authors Thomas Buckley and Peter Brodeur, evaluated 76 real emergency department cases drawn directly from electronic health records at a Boston hospital. The team tested the model and two attending internal medicine physicians at three sequential touchpoints — triage, first physician contact, and admission to a medical floor or ICU — then had two additional doctors blindly grade the assessments.

A widening gap at the front door of the ER

At initial triage, o1 produced the exact or very close diagnosis in 67% of cases. The two attending physicians scored 55% and 50% on the same task. Across later stages, the AI was reported as performing "nominally better than or on par" with human clinicians, with a particular edge on rare-disease diagnoses and management reasoning questions such as antibiotic selection and end-of-life care decisions.

The study is part of a broader pattern of clinical reasoning evaluations that have struggled to keep up with frontier model capabilities. "Models are increasingly capable. We used to evaluate models with multiple-choice tests; now they are consistently scoring close to 100%, and we can't track progress anymore because we're already at the ceiling," Brodeur, an HMS clinical fellow in medicine at Beth Israel Deaconess, said in materials accompanying the release.

The authors' caution undercuts the headline

Despite the favorable numbers, the researchers were unusually direct about what the work does not show. The paper does not claim o1 is ready for autonomous use in clinical settings, and the team explicitly called for prospective trials in real-world patient care before any deployment. Rodman pushed back on the inevitable framing in interviews, saying the results do not support removing physicians from the decision-making process.

There are also methodological caveats worth noting. The model was given only the text-based information available to clinicians at each stage — no imaging, no in-person assessment, no live conversation with the patient. And at least one outside emergency physician noted that the human comparators were internal medicine attendings rather than ER specialists, a distinction that could inflate the apparent gap at triage.

Why this one matters

Clinical reasoning has long been treated as one of the harder tests for general-purpose models, and prior LLM evaluations in medicine have leaned heavily on standardized vignettes that critics argued were too clean. Running a frontier model against unfiltered EHR data at multiple decision points — and beating attending physicians on the most information-poor step — is the kind of result that hospital systems, payers, and regulators are likely to cite for years, regardless of how carefully the authors hedge their conclusions.

Learn AI for Free — FreeAcademy.ai

Take "AI Essentials: Understanding AI in 2026" — a free course with certificate to master the skills behind this story.

More in Research

ARC Prize Analysis: GPT-5.5 and Opus 4.7 Share Three Systematic Reasoning Errors on ARC-AGI-3
Research

ARC Prize Analysis: GPT-5.5 and Opus 4.7 Share Three Systematic Reasoning Errors on ARC-AGI-3

A new ARC Prize Foundation analysis of 160 replays shows OpenAI's GPT-5.5 and Anthropic's Claude Opus 4.7 stay below 1% on ARC-AGI-3 because of three recurring failure modes — and they fail differently.

1 day ago3 min read
MIT's FTTE Cuts Federated Learning Time 81%, Brings AI Training to Smartwatches and Sensors
Research

MIT's FTTE Cuts Federated Learning Time 81%, Brings AI Training to Smartwatches and Sensors

MIT CSAIL's Federated Tiny Training Engine reports 81% faster training, 80% less on-device memory, and 69% smaller communication payloads — putting privacy-preserving AI training within reach of small edge hardware.

2 days ago3 min read
MIT's EnergAIzer Predicts AI Power Use in Seconds, Cuts Wasted Energy in Data Centers
Research

MIT's EnergAIzer Predicts AI Power Use in Seconds, Cuts Wasted Energy in Data Centers

MIT and the MIT-IBM Watson AI Lab unveiled EnergAIzer, a tool that estimates how much electricity an AI workload will consume on a given GPU in seconds rather than hours, with about 8% error.

4 days ago2 min read