Study finds AI chatbots no better than search engines for medical advice
A randomized controlled trial published in Nature Medicine with 1,298 UK participants found that AI chatbot users (GPT-4o, Llama 3, Command R+) performed no better than the control group at assessing clinical urgency and worse at identifying relevant medical conditions. In one case, two users with identical subarachnoid hemorrhage symptoms received opposite recommendations -- one told to lie down in a dark room, the other correctly advised to seek emergency care.
Incident Details
Tech Stack
References
Passing Exams Is Not Practicing Medicine
AI chatbots can pass medical licensing exams. This has been one of the most frequently cited achievements in the AI hype cycle - proof, supposedly, that large language models understand medicine well enough to help patients. The problem is that passing an exam and actually helping a real person figure out whether they should go to the emergency room are fundamentally different tasks.
A randomized controlled trial published in Nature Medicine in February 2026 set out to test what happens when non-medical members of the public use AI chatbots for medical guidance. The results were, at best, discouraging.
The Study Design
The research was led by the Oxford Internet Institute and the Nuffield Department of Primary Care Health Sciences at the University of Oxford, in partnership with MLCommons and other institutions. It was the largest user study to date examining how large language models support real people making medical decisions.
A total of 1,298 participants from the UK general public - none with medical training - were randomly assigned to one of four groups. Three groups were each given access to a different AI chatbot: OpenAI's GPT-4o, Meta's Llama 3, and Cohere's Command R+. The fourth group, serving as the control, was told to use whatever resources they would normally use at home - search engines, NHS websites, asking family members, or simply guessing.
Each participant was presented with clinical scenarios designed specifically for the study by three physicians. The scenarios ranged from relatively mundane to genuinely urgent: a young man developing a severe headache after a night out with friends, a new mother feeling constantly out of breath and exhausted. For each scenario, participants were asked to assess how urgently the person should seek medical care and to identify the relevant underlying condition.
The Results
The LLM-assisted groups performed no better than the control group at assessing clinical acuity. People using GPT-4o, Llama 3, or Command R+ were not more accurate at determining whether a medical scenario required immediate emergency care, an urgent appointment, a routine visit, or simply staying home compared to people using Google, the NHS website, or their own judgment.
Worse, the chatbot groups actually performed worse at identifying the relevant medical conditions. The AI tools did not help users arrive at better diagnoses. They helped them arrive at worse ones.
The chatbot groups also consistently underestimated the urgency of clinical scenarios. For cases where the medically appropriate response was to go to the emergency room immediately, the chatbots frequently recommended less urgent action. One analysis found that in emergency-level scenarios, one chatbot recommended emergency evaluation in only about half of cases. The rest of the time, it suggested routine appointments or home care for conditions that physicians deemed emergent.
Conversely, for scenarios where the appropriate response was to stay home, the chatbots over-triaged, recommending doctor's appointments roughly 65% of the time when no medical visit was needed. The result was a model that simultaneously missed real emergencies and created unnecessary anxiety about non-emergencies - the worst of both worlds for a triage tool.
The Subarachnoid Hemorrhage Problem
The most alarming finding centered on a scenario describing symptoms of a subarachnoid hemorrhage - a type of brain bleed that is a genuine medical emergency requiring immediate treatment. Two participants sent very similar messages to the chatbot describing essentially the same set of symptoms.
They received opposite advice.
One user was correctly advised to seek emergency care immediately. The other was told to lie down in a dark room. Forbes reported on this case as a particularly stark illustration of the inconsistency problem. Same clinical scenario, same chatbot system, two entirely different - and potentially life-or-death - recommendations. A prehospital emergency where minutes matter, and the AI's response amounted to a coin flip.
This is not a theoretical failure mode. Subarachnoid hemorrhages have a mortality rate of approximately 50%, and outcomes are heavily dependent on rapid treatment. Telling someone experiencing one to lie down in a dark room is advice that could directly contribute to a death.
Why Benchmarks Do Not Predict Real-World Performance
The study's findings illuminate a gap that the AI industry has been slow to acknowledge: benchmark performance does not predict real-world clinical utility.
When AI chatbots are tested on medical licensing exams, they are presented with structured questions that have clear correct answers, in a controlled format, with all relevant information provided. When a real person with no medical training asks a chatbot about their symptoms, the interaction is nothing like an exam. The person may describe their symptoms imprecisely, omit critical details, use colloquial language, or ask follow-up questions that push the conversation in unhelpful directions.
The chatbot, meanwhile, has no ability to physically examine the patient, no access to their medical history (unless they provide it), and no ability to ask the kinds of probing follow-up questions that a trained clinician would. It can only work with what the user types, and the study demonstrated that this gap between ideal input and real-world input is large enough to eliminate any advantage the AI might theoretically provide.
Dr. Rebecca Payne, the study's lead medical practitioner from the Nuffield Department of Primary Care Health Sciences at Oxford and Bangor University, stated: "These findings highlight the difficulty of building AI systems that can genuinely support people in sensitive, high-stakes areas like health." She drew an explicit parallel to pharmaceutical regulation: "Just as we require clinical trials for new medications, AI systems need rigorous testing with diverse, real users to understand their true capabilities in high-stakes settings like healthcare."
The Industry Context
The study arrived at a moment when OpenAI, Anthropic, Amazon, and other AI companies are actively expanding into healthcare. ChatGPT Health had been positioned as a consumer-facing medical assistant. These companies are pursuing integration with patient medical records and clinical workflows.
The Oxford study suggests that the foundation these products are built on - the idea that AI chatbots can reliably help non-experts make medical decisions - is not supported by evidence from real-world testing. The chatbots perform comparably to a control group that includes people who simply searched the internet, and they introduce dangerous inconsistency in precisely the kinds of urgent scenarios where consistency matters most.
What This Means
The study does not prove that AI has no role in healthcare. It proves something narrower and more important: that the current generation of general-purpose AI chatbots, when used by members of the general public to assess their own symptoms, provides no measurable benefit over existing resources and introduces potentially life-threatening inconsistencies.
Every fresh medical graduate understands something that AI benchmark scores obscure: the ability to pass an exam does not automatically translate to the ability to care for patients. The Oxford study made this point with 1,298 data points and at least one person who was told to lie down in a dark room when they may have been experiencing a brain bleed. The gap between what AI chatbots can do on a test and what they can do in a living room with a worried person typing their symptoms is not a minor implementation detail. It is the entire problem.
Discussion