A new Harvard Medical School study published in Science on April 30 — and amplified by tech and medical outlets over the weekend — found that OpenAI's o1 reasoning model matched or outperformed attending physicians across a battery of emergency department diagnostic and management tasks. The most striking gap appeared at the moment doctors typically have the least information to work with: the initial triage decision.
The research, led by senior authors Arjun Manrai (Harvard Medical School) and Adam Rodman (Beth Israel Deaconess Medical Center), with co-first authors Thomas Buckley and Peter Brodeur, evaluated 76 real emergency department cases drawn directly from electronic health records at a Boston hospital. The team tested the model and two attending internal medicine physicians at three sequential touchpoints — triage, first physician contact, and admission to a medical floor or ICU — then had two additional doctors blindly grade the assessments.
A widening gap at the front door of the ER
At initial triage, o1 produced the exact or very close diagnosis in 67% of cases. The two attending physicians scored 55% and 50% on the same task. Across later stages, the AI was reported as performing "nominally better than or on par" with human clinicians, with a particular edge on rare-disease diagnoses and management reasoning questions such as antibiotic selection and end-of-life care decisions.
The study is part of a broader pattern of clinical reasoning evaluations that have struggled to keep up with frontier model capabilities. "Models are increasingly capable. We used to evaluate models with multiple-choice tests; now they are consistently scoring close to 100%, and we can't track progress anymore because we're already at the ceiling," Brodeur, an HMS clinical fellow in medicine at Beth Israel Deaconess, said in materials accompanying the release.
The authors' caution undercuts the headline
Despite the favorable numbers, the researchers were unusually direct about what the work does not show. The paper does not claim o1 is ready for autonomous use in clinical settings, and the team explicitly called for prospective trials in real-world patient care before any deployment. Rodman pushed back on the inevitable framing in interviews, saying the results do not support removing physicians from the decision-making process.
There are also methodological caveats worth noting. The model was given only the text-based information available to clinicians at each stage — no imaging, no in-person assessment, no live conversation with the patient. And at least one outside emergency physician noted that the human comparators were internal medicine attendings rather than ER specialists, a distinction that could inflate the apparent gap at triage.
Why this one matters
Clinical reasoning has long been treated as one of the harder tests for general-purpose models, and prior LLM evaluations in medicine have leaned heavily on standardized vignettes that critics argued were too clean. Running a frontier model against unfiltered EHR data at multiple decision points — and beating attending physicians on the most information-poor step — is the kind of result that hospital systems, payers, and regulators are likely to cite for years, regardless of how carefully the authors hedge their conclusions.



