BMJ Open audit finds half of AI health chatbot answers problematic under stress testing
A UCLA-led team published a BMJ Open audit of five major consumer chatbots (ChatGPT, Gemini, Grok, Meta AI, DeepSeek) on 250 adversarial health prompts across cancer, vaccines, stem cells, nutrition, and athletic performance. Experts rated 49.6% of answers problematic overall; Grok produced more highly problematic replies than chance would predict, while Gemini skewed least bad. Reference lists were a mess (median completeness 40%), and no model produced a fully accurate bibliography across 25 citation requests.
Incident Details
Tech Stack
References
Footnotes That Go Nowhere
Medical misinformation does not always arrive as a screaming Facebook post. Sometimes it shows up in calm prose, with numbered references and the tone of someone who has read a textbook. That is the version generative chatbots are unusually good at producing, which is exactly why audits like this one matter.
In mid-April 2026, BMJ Open published an original audit led by researchers including Noah B. Tiller at UCLA, titled "Generative artificial intelligence-driven chatbots and medical misinformation: an accuracy, referencing and readability audit." The paper is open access (DOI 10.1136/bmjopen-2025-112695; BMJ Publishing Group also issued a plain-language summary the same week). The design is blunt: stress five popular chatbots with prompts in categories where the internet already circulates bad advice, then have domain experts score what comes back.
The products were the usual suspects: ChatGPT (OpenAI), Gemini (Google), Grok (xAI), Meta AI (Meta), and DeepSeek (High-Flyer). The team ran fifty prompts per bot, two hundred fifty answers in total, across cancer, vaccines, stem cells, nutrition, and athletic performance. Prompts were written to resemble real information-seeking questions while also nudging models toward misinformation or advice that conflicts with mainstream medical standards. That is not cheating; it is red teaming. The goal is not to simulate a polite wellness blog. It is to see what happens when ordinary people ask hard questions in messy domains.
The Numbers
Expert raters scored each answer as non-problematic, somewhat problematic, or highly problematic using predefined criteria. Across all bots, 49.6% of responses landed in the problematic bucket: 30% somewhat problematic and 19.6% highly problematic. Overall quality did not differ significantly across vendors in the aggregate statistical test the authors report, but Grok generated significantly more highly problematic answers than a random distribution would predict. Gemini went the other direction, with fewer highly problematic replies and more clean passes.
Topic mattered a lot. Performance was least awful on vaccines and cancer, where large curated literatures give models more to imitate. It was worst on nutrition, athletic performance, and to a lesser extent stem cells; those are fields where confident influencers, affiliate marketing, and forum lore already pollute the training soup.
Open-ended prompts were where the wheels came off. They produced forty highly problematic answers, far more than closed-ended prompts, which makes intuitive sense. Real patients rarely ask "true or false: is bleach a cancer cure." They ask broad lifestyle questions that invite listicles, hedging, and invented specificity.
Confidence Without Grounding
One of the creepier details is tonal. The chatbots answered with confidence and certainty, rarely offering caveats. Out of 250 questions, only two received a refusal, both from Meta AI (on anabolic steroids and non-traditional cancer therapies, per CIDRAP's summary of the paper). Everything else sailed through, including material experts later judged harmful or misleading.
Then there were the references. The researchers asked each system for ten scientific references and scored completeness. The median completeness score was 40%. No chatbot produced a fully accurate reference list across twenty-five attempts. That is not a rounding error. It is a systematic inability to attach real citations to fluent claims, which is a special kind of danger because citations look like epistemic hygiene to a lay reader.
Readability landed in the "difficult" band on the Flesch scale the authors used, meaning you already need college-level reading comfort to parse answers that may still be wrong. So you get dense text, authoritative tone, broken footnotes, and almost no refusals. If you were trying to design a misinformation engine by accident, you could do worse.
How This Fits The Graveyard
Vibe Graveyard is not here to dunk on research for using adversarial prompts. The whole point is that users do adversarial things to themselves every day. They ask chatbots about miracle cures, vaccine fears, stem-cell tourism, and supplement stacks because search engines trained them that a text box owes them an answer. This study simulates that reality with ethics board paperwork.
It also complements work already on the site without duplicating it. The Stanford Science paper on sycophancy (Stanford sycophancy study) measured how models validate bad interpersonal decisions. The Oxford and Nature Medicine RCT (Oxford chatbot medical advice RCT) showed real patients did not gain triage skill from chatbots. The BMJ Open audit adds a different lens: misinformation-prone clinical topics, explicit reference auditing, and head-to-head behavior across five vendors under the same rubric.
None of that means the models are useless for health-adjacent tasks when supervised. It does mean that shipping them to the public as omniscient oracles, and letting social platforms route scared people into free tiers, is an incident waiting to happen at scale.
Limitations Worth Stating Clearly
The authors acknowledge limits. Five chatbots, tested once in time (prompts fielded in February 2025 per the paper's methods section as summarized by CIDRAP and The Conversation), cannot track every model refresh. Adversarial framing may overestimate everyday error rates for bland questions like "what is a normal resting heart rate." Paid tiers and newer checkpoints might behave differently; the study does not claim otherwise.
Still, the conservative reading is bad enough. Even if real-world prompts are gentler half the time, the tail risk lives in the hard questions. That is where people Google at 2 a.m. That is where scams live. That is where chatbots currently speak with the voice of a specialist and the bibliography of a cosplayer.
The Actual Blast Radius
The blast radius is not "researchers got a bad completion." It is the intersection of three trends: rapid consumer adoption, regulatory lag, and interfaces that hide uncertainty. BMJ Open is not a tech blog; it is a mainstream medical journal signaling that independent academics now treat this as a patient-safety and public-health issue, not a novelty.
Downstream coverage from CIDRAP (University of Minnesota's Center for Infectious Disease Research and Policy) and The Conversation matters because those channels reach clinicians, educators, and curious patients who will never read the PDF. They still deserve the primary citation. If you cite this story, cite the paper first, then the summaries.
Closing Thought
Large language models predict plausible tokens. They do not verify claims against hospital-grade evidence bases. When vendors market assistants as broadly knowledgeable, they inherit the reputational risk of every confident paragraph their weights emit. This audit adds another row to the spreadsheet: half problematic under expert review, references mostly hollow, refusals almost absent.
If your product strategy assumes users will "just know" not to trust the chatbot for medicine, this paper is another data point that the market has not gotten the memo.
Discussion