What did the Harvard study actually measure?

Researchers tested OpenAI's o1 model against attending physicians on 76 real emergency department cases at a Boston hospital, evaluating diagnostic and management reasoning at three stages: initial triage, first physician contact, and admission. Two independent doctors blindly graded the assessments.

How much better did the AI perform?

At the initial triage stage, o1 reached the exact or very close diagnosis in 67% of cases, compared with 55% and 50% for the two attending physicians it was tested against. Performance was strongest where information was most limited.

Should AI now replace emergency room doctors?

The authors explicitly say no. They argue physicians must remain in the diagnostic loop and call for prospective real-world trials before any clinical deployment, noting that the study used only text-based information from electronic health records.

Harvard Study: OpenAI's o1 Outperforms ER Doctors on Diagnosis Accuracy

A new Harvard Medical School study published in Science on April 30 — and amplified by tech and medical outlets over the weekend — found that OpenAI's o1 reasoning model matched or outperformed attending physicians across a battery of emergency department diagnostic and management tasks. The most striking gap appeared at the moment doctors typically have the least information to work with: the initial triage decision.

The research, led by senior authors Arjun Manrai (Harvard Medical School) and Adam Rodman (Beth Israel Deaconess Medical Center), with co-first authors Thomas Buckley and Peter Brodeur, evaluated 76 real emergency department cases drawn directly from electronic health records at a Boston hospital. The team tested the model and two attending internal medicine physicians at three sequential touchpoints — triage, first physician contact, and admission to a medical floor or ICU — then had two additional doctors blindly grade the assessments.

A widening gap at the front door of the ER

At initial triage, o1 produced the exact or very close diagnosis in 67% of cases. The two attending physicians scored 55% and 50% on the same task. Across later stages, the AI was reported as performing "nominally better than or on par" with human clinicians, with a particular edge on rare-disease diagnoses and management reasoning questions such as antibiotic selection and end-of-life care decisions.

The study is part of a broader pattern of clinical reasoning evaluations that have struggled to keep up with frontier model capabilities. "Models are increasingly capable. We used to evaluate models with multiple-choice tests; now they are consistently scoring close to 100%, and we can't track progress anymore because we're already at the ceiling," Brodeur, an HMS clinical fellow in medicine at Beth Israel Deaconess, said in materials accompanying the release.

The authors' caution undercuts the headline

Despite the favorable numbers, the researchers were unusually direct about what the work does not show. The paper does not claim o1 is ready for autonomous use in clinical settings, and the team explicitly called for prospective trials in real-world patient care before any deployment. Rodman pushed back on the inevitable framing in interviews, saying the results do not support removing physicians from the decision-making process.

There are also methodological caveats worth noting. The model was given only the text-based information available to clinicians at each stage — no imaging, no in-person assessment, no live conversation with the patient. And at least one outside emergency physician noted that the human comparators were internal medicine attendings rather than ER specialists, a distinction that could inflate the apparent gap at triage.

Why this one matters

Clinical reasoning has long been treated as one of the harder tests for general-purpose models, and prior LLM evaluations in medicine have leaned heavily on standardized vignettes that critics argued were too clean. Running a frontier model against unfiltered EHR data at multiple decision points — and beating attending physicians on the most information-poor step — is the kind of result that hospital systems, payers, and regulators are likely to cite for years, regardless of how carefully the authors hedge their conclusions.

Harvard Study: OpenAI's o1 Outperforms ER Doctors on Diagnosis Accuracy

A widening gap at the front door of the ER

The authors' caution undercuts the headline

Why this one matters

More in Research

ARC Prize Analysis: GPT-5.5 and Opus 4.7 Share Three Systematic Reasoning Errors on ARC-AGI-3

MIT's FTTE Cuts Federated Learning Time 81%, Brings AI Training to Smartwatches and Sensors

MIT's EnergAIzer Predicts AI Power Use in Seconds, Cuts Wasted Energy in Data Centers