Harvard Science Study Finds AI Model Outperforms Human Doctors in Emergency Diagnosis
A Harvard‑led study published in Science evaluated OpenAI’s o1‑preview model across six rigorous clinical benchmarks and real‑world emergency cases, finding it surpassed seasoned physicians in diagnostic accuracy—ranking in the top 78% of cases, achieving up to 97.9% accuracy and outperforming GPT‑4 by a large margin.
Historical benchmark for clinical reasoning
Since 1959 scientists have called for case‑based benchmarks to evaluate expert‑level medical computing systems. The New England Journal of Medicine’s clinicopathological conferences have long served as the ultimate test, presenting rare, trap‑laden cases that even top specialists find difficult. Prior diagnostic generators—Bayesian systems, rule‑based engines, and symptom checkers—generally failed on these benchmarks.
Evaluation of o1‑preview on classic NEJM cases
Researchers assembled 80 high‑difficulty cases from the NEJM clinicopathological conference series and scored core clinical‑reasoning records with the validated 10‑point Revised‑IDEA rubric. o1‑preview earned a perfect score on 78 of the 80 cases, far outpacing GPT‑4, attending physicians, and residents.
In differential‑diagnosis lists, the correct diagnosis appeared in 78.3% of cases and was ranked first in 52% of cases. When the criterion was broadened to “potentially helpful or very close,” accuracy rose to 97.9%.
On a separate set of 70 historical cases, GPT‑4 achieved 72.9% accuracy, whereas o1‑preview reached 88.6%.
For next‑step diagnostic‑test selection, o1‑preview chose the correct investigation in 87.5% of 136 challenging cases; 11% of its suggestions were judged substantially helpful, and only 1.5% were considered unconstructive.
Blind test in a real emergency department
Seventy‑six randomly selected emergency‑room records from Beth Israel Deaconess Medical Center were evaluated by o1‑preview, GPT‑4o, and two senior internal‑medicine physicians. Two additional senior physicians acted as blind judges, scoring the reports without knowing their source.
Judges could not distinguish AI‑generated from human‑generated reports in 94.4% of cases.
Accuracy at three clinical touchpoints for o1‑preview was 67.1% (triage), 72.4% (physician assessment), and 81.6% (admission decision), consistently higher than the two human physicians, especially in the information‑scarce triage stage.
Probability estimation study
A nationwide sample of 553 clinicians estimated disease prevalence before and after diagnosis. Individual estimates varied wildly, revealing large inconsistency. AI‑generated probability estimates were markedly more stable and aligned closely with literature‑derived reference values, particularly for myocardial‑ischemia testing.
Implications
The extensive testing demonstrates that modern large‑language models can robustly handle complex, unstructured clinical text and outperform human experts on multiple diagnostic and management tasks. Current experiments are limited to textual data; real‑world encounters involve multimodal cues—patient vocalizations, breathing patterns, imaging nuances—that current models cannot yet perceive.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
SuanNi
A community for AI developers that aggregates large-model development services, models, and compute power.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
