Twitter/X

Author @kimmonismus asserts it is unethical for a doctor not to consult an AI…

Brief

Author @kimmonismus argues a Science paper (published 2026-05-10) shows OpenAI's o1 model—already over a year old—diagnosed correctly or near-correctly 67% of the time versus 50–55% for ER doctors, excelling in early triage and structured clinical reasoning on messy real ER data; limitations include no imaging tests, only short encounters, and no demonstrated patient-outcome benefit.

Why it matters

Author @kimmonismus asserts it is unethical for a doctor not to consult an AI, citing a Science study published 2026-05-10 where OpenAI's o1 model achieved correct or near-correct diagnoses 67% of the time versus 50–55% for ER physicians.

Key details

  • The post states o1 scored near-perfect on clinical reasoning in structured cases and outperformed attending physicians most markedly during early ER triage when information is limited; the study tested the model on real, messy ER data rather than curated textbook cases.
  • Limitations noted: the study covered only short ER encounters, did not evaluate imaging (scans/X-rays), and did not prove improved patient outcomes; the author emphasizes o1 is over 1 year old, implying current models are likely even better.
Source evidence

It is unethical for a doctor not to consult an AI!

A new study published in Science shows OpenAI's o1 model (not 5.5, but the over 1 year old o1!) outperformed ER physicians at diagnosing patients, identifying the correct or near-correct diagnosis 67% of the time versus 50–55% for doctors, especially in early triage when information is limited.

The model also scored near-perfect on clinical reasoning in structured cases, far ahead of attending physicians.

Again: a model over 1 year old, which is ages in the times of AI.

This is one of the first studies testing an LLM against real, messy ER data rather than curated textbook cases. The performance gap was widest exactly where mistakes are most dangerous, early in the ER process when doctors have incomplete information and are under time pressure.

And the model tested (o1) is already outdated by AI standards, meaning current models are likely even better.

The study only covered short ER encounters, not longer hospitalizations with days of accumulating data. It also didn't test the model on imaging (scans, X-rays), which is central to many real diagnoses. The next step is proving these systems actually improve patient outcomes in practice, not just in controlled comparisons. But i bet the models will also outperform human doctors on such cases.