JAMA study: all 21 AI models fail at early clinical reasoning more than 80% of the time
Researchers at Mass General Brigham published a JAMA Network Open study evaluating 21 large language models - including ChatGPT, Claude, Gemini, Grok, and DeepSeek - across 29 standardized clinical cases using a new evaluation tool called PrIME-LLM. Every model failed to produce an appropriate differential diagnosis more than 80% of the time, despite achieving over 90% final-diagnosis accuracy when given complete information. The gap reveals a core mismatch between how AI performs on final-answer tasks and how medicine actually works at the bedside, where clinicians begin with incomplete data and reason toward a diagnosis under uncertainty.
Incident Details
Tech Stack
References
There's a version of this study result that sounds reassuring. Twenty-one AI models - the latest from OpenAI, Google, Anthropic, xAI, DeepSeek, and others - were tested on clinical cases. They achieved correct final diagnoses more than 90% of the time. Nine out of ten. Better than a lot of medical school exam scores. See? AI is getting good at medicine.
The study, published in JAMA Network Open on April 14, 2026 by researchers at Mass General Brigham, is actually about why that framing is dangerous.
The 90% figure is real. So is the other number: every single one of those 21 models failed to produce an appropriate differential diagnosis more than 80% of the time. Those two statistics describe the same models on the same cases, and understanding the gap between them is the point of the whole paper.
Two Different Tasks
Medicine is not a single task. It's a sequence of tasks, and they have very different requirements.
A final diagnosis question looks like this: here is a patient, age 52, presenting with chest pain radiating to the left arm, diaphoresis, and shortness of breath, with an ECG showing ST elevation in leads II, III, and aVF. What is the diagnosis? An AI model given all of that information will, 90% of the time, correctly identify ST-elevation myocardial infarction. So will a first-year medical student. That case is a pattern recognition task with all the relevant information laid out.
A differential diagnosis question looks like this: here is a patient, age 52, who came in complaining of fatigue and some shortness of breath with exertion. What are the possible diagnoses, and what should you investigate? That case is a reasoning task with incomplete information. The answer requires generating a list of plausible explanations, ranking them by likelihood and clinical significance, and identifying which tests would distinguish between them. The ECG comes later. The troponin comes later. Right now, you have fatigue and exertion dyspnea, and you need to build a framework for figuring out what's happening.
The differential diagnosis task is where medicine actually starts. It's the first step in almost every clinical encounter. And it's the step where AI falls apart.
PrIME-LLM and Why the Benchmark Matters
Lead author Arya Rao, an MD-PhD candidate at Harvard Medical School, and co-author Marc Succi, the executive director of Mass General Brigham's MESH Incubator, developed a benchmarking tool called PrIME-LLM specifically to assess models across the whole clinical reasoning pipeline rather than just the final answer.
The acronym reflects the stages: Primary (initial differential diagnosis), Investigation (appropriate test selection), Management (treatment decisions), and Evaluation (final diagnosis). Most existing AI medical benchmarks evaluate only the final diagnosis stage - which is why they show high accuracy and which is why those high accuracy numbers don't tell you much about clinical utility.
PrIME-LLM scores across all 21 models ranged from 64% (Gemini 1.5 Flash, the weakest performer) to 78% (Grok 4 and GPT-5, the strongest). For context: a score of 78% on a composite of clinical reasoning tasks is, by most rubrics, not a passing grade for anything you'd want making unsupervised decisions about patient care.
The study used 29 standardized clinical cases across a range of presentations. Models were given information progressively, as it would arrive in a real clinical encounter: initial complaint first, then exam findings, then lab results. The differential diagnosis failure rate - over 80% for all models - reflects performance at the earliest stage, when information is most incomplete and clinical reasoning is most necessary.
What 80% Failure at Differential Diagnosis Means in Practice
To be precise: failing to produce an appropriate differential diagnosis doesn't mean the model gave a completely wrong answer every time. It means the model's early-stage reasoning was rated as inappropriate by the study's clinical reviewers more than 80% of the time across 29 standardized cases. The failure could be generating a list that misses the most likely or most dangerous diagnosis, omitting diagnoses that should be urgently ruled out, prioritizing rare conditions when common ones fit the presentation, or generating a framework that doesn't match the clinical picture in ways that would delay appropriate workup.
In a supervised clinical setting, this level of error gets caught. A physician reviews the model's suggestions, identifies what's missing, and corrects course. The model is a tool; the physician is the clinician; the 80% error rate is an inconvenience rather than a patient safety event.
In an unsupervised setting - which is how many healthcare systems are quietly deploying AI - the early-stage errors propagate. If the differential diagnosis is wrong, the investigations it generates will miss important diagnoses. If the investigations miss important diagnoses, the treatment plan won't address the underlying condition. The final diagnosis step may still produce a correct answer if provided with complete information eventually, but the path to getting there will have been longer and more expensive, and some patients will not make it through that path without harm.
The Supervision Gap
Co-author Marc Succi put it plainly: "Large language models in healthcare continue to require a 'human in the loop' and very close oversight." The paper concludes that current models "are not ready for unsupervised clinical-grade deployment."
The problem is that "human in the loop" and "very close oversight" describe a deployment posture that requires significant infrastructure, training, and institutional commitment - things that are often absent when healthcare systems acquire AI tools for efficiency reasons. The sales pitch is generally "AI reduces physician time spent on documentation/review/research." The fine print is "provided physicians supervise the AI output carefully, which is not exactly less work."
This creates a structural tension: the use cases where AI provides the most efficiency gain tend to be the ones where human review is most likely to be cursory, because the whole point is to save the clinician's time. And cursory review of a system that fails 80% of the time on early differential diagnosis is where patient safety events happen.
Relating This to What's Already Known
This is not the first study to find that AI medical advice is unreliable. Vibe Graveyard has documented several related incidents and research findings. A UCLA-led study published in BMJ Open found that nearly half of health chatbot answers were rated problematic by medical experts (BMJ Open medical chatbot audit). An Oxford RCT found that using AI chatbots for medical questions did not improve triage accuracy compared to controls (Oxford AI medical chatbots study). The ECRI Institute listed AI-enabled clinical decision support as a top health technology hazard for 2026 (ECRI AI chatbot hazard report).
What the JAMA Network Open study adds is methodological specificity about which part of clinical reasoning breaks down. The prior studies documented that AI health advice is often wrong; this one documents where in the clinical reasoning process the wrongness originates. Differential diagnosis - the first step, the reasoning-under-uncertainty step, the step that structures all subsequent care - is specifically where the models underperform most severely.
For any health system evaluating whether to deploy AI clinical decision support, that precision matters. It is considerably less useful after deployment to thousands of patient encounters.
The Gap Is Not an Accident
The 90% final-diagnosis accuracy figure is the result of testing models on cases where all clinically relevant information is provided upfront. That's how most AI medical benchmarks work, because it's easier to design and score. You give the model everything a physician would know at the end of a workup, and you see if it can name the diagnosis.
Real clinical encounters do not work this way. The information arrives over time. The physician - or in increasingly common deployments, the AI - has to work with what's available at each step, generate the right questions to ask and tests to order, and update the working hypothesis as new information arrives. This is the core skill of clinical medicine, and it's the skill the JAMA Network Open study found AI cannot yet reliably do.
PrIME-LLM is a better benchmark specifically because it tests this process rather than just the final answer. The scores it produces (64-78% across 21 models) are a more honest assessment of clinical AI capability than the final-diagnosis accuracy numbers that dominate marketing materials.
Both industries have financial incentives to emphasize the 90% number. PrIME-LLM exists to make that harder to do honestly.
Discussion