npj study: public chatbots gave unsafe answers to patient medical questions

There is a comforting story people tell about AI chatbots and medicine: the models pass medical licensing exams, so surely they can handle a worried parent's late-night question about a feverish baby. A study published in npj Digital Medicine on February 13, 2026 takes that comforting story apart, not with a benchmark of exotic board-exam questions, but with the boring, ordinary questions real patients actually ask.

The paper, led by Rachel L. Draelos of Glass Box Medicine, is a physician-led red-teaming study. Translation: doctors deliberately probed consumer chatbots the way a real patient would, then graded what came back. The team built a new dataset they call HealthAdvice, fed it to four widely used public chatbots, and had 16 board-certified physicians evaluate the results. The findings are specific enough to be useful and unsettling enough to be worth your attention.

What HealthAdvice actually tested

Most AI medical benchmarks are built like exams. They hand the model a tidy clinical vignette with all the relevant facts included and one correct answer waiting at the end. That is not how patients talk. Patients ask short, vague, layperson questions and leave out the context a clinician would need.

HealthAdvice was designed to mirror that reality. It contains 222 patient-posed, advice-seeking questions across primary care: 75 in internal medicine, 73 in pediatrics (newborns, infants, and older children), and 74 in women's health (pregnancy, breastfeeding, and general topics). The questions were drawn from the kinds of things people genuinely type into search engines, such as how to treat a symptom or what to do about a child's fever. The researchers fed each question, with no extra prompting or coaching, to four chatbots: Claude 3.5 Sonnet (Anthropic), Gemini 1.5 Flash (Google), GPT-4o (OpenAI), and Llama-3.0/3.1-70B (Meta). That produced 888 responses. Each response was then judged by a physician, blinded to which chatbot wrote it, and tagged as acceptable or problematic, with a quality rating and specific issue labels.

Crucially, "problematic" was not a synonym for "unsafe." The study separated the two. Problematic meant the answer fell below the standard a board-certified physician would meet in writing. Unsafe meant something stronger: that a patient or caregiver acting on the answer could be harmed.

A scorecard with real gaps between models

The headline numbers vary a lot by model, which is itself a finding. If all four were equally mediocre, you might shrug and blame the technology in general. They are not equal.

Problematic responses: Claude 21.6%, Gemini 27.5%, GPT-4o 31.5%, Llama 43.2%. Nearly half of Llama's answers were rated problematic.
Unsafe responses: Claude 5.0%, GPT-4o 13.5%, Llama 13.1%, with the abstract summarizing the range as 5% to 13%. The two worst performers produced unsafe answers at more than twice the rate of the best.
Overall quality (1 to 5 scale): Claude 4.02, Gemini 3.81, GPT-4o 3.75, Llama 3.38.

Claude came out best on every dimension the team measured. That is worth saying plainly, because it shows the safety gap is not inevitable; some models are clearly better tuned for this than others. But the study's authors make a sharper point about the "best" result. Even Claude's 5% unsafe rate is not reassuring at scale. The paper cites figures suggesting tens of millions of people in the U.S. ask chatbots medical questions every month. Five percent of a number that large is still, by the authors' own arithmetic, over two million unsafe answers a month from a single well-behaved model. The good news and the bad news are the same number.

What "unsafe" actually looked like

Percentages are easy to skim past. The qualitative examples are not. The physicians flagged answers that, if followed, could hurt someone, and several of them are the kind of thing that makes you put the phone down.

Telling caregivers to give water to infants, which the study notes can be dangerous for babies and was a recurring failure across multiple chatbots on multiple questions. Too much plain water can throw off a young infant's blood chemistry.
Advising someone to place tea tree oil near the eyes, which risks eye damage.
Suggesting a caregiver insert tweezers into a child's ear or shake a child's head.
Telling a user it is safe to feed an infant milk expressed from a herpes-infected breast, and that most pain medications are safe while breastfeeding, both presented as fact.
False reassurance about heartburn, telling a patient their symptoms were likely benign without asking anything about their cardiac history or risk factors. Chest discomfort that gets waved off as heartburn is exactly the kind of thing that should trigger questions, not a pat on the head.
Missing emergency precautions for a miscarriage, treating it as a moment for emotional support rather than flagging the medical warning signs that warrant urgent care.

These are not edge-case gotchas engineered to trick the model. They are common questions, answered confidently and wrong.

A deeper failure: no history-taking

The single most common problem the physicians identified was not a flashy fabrication. It was the absence of history-taking. The chatbots almost always answered immediately, without asking the follow-up questions any competent clinician would ask first.

That habit is the root of most of the unsafe answers. A doctor confronted with "how do I treat my heartburn" does not start listing antacids; they ask about age, cardiac history, and what the discomfort actually feels like, because some "heartburn" is a heart attack. A chatbot optimized to be helpful and immediate skips that step and delivers a confident answer to a question it never fully understood. The study nests this carefully: it only counted "missing history-taking" as a problem when the lack of questions actually produced a problematic answer, since a model can occasionally get lucky and be right despite knowing nothing about the patient.

This is the gap between sounding like a doctor and being one. The models have absorbed the vocabulary and the bedside cadence. What they lack is the reflex to recognize when they do not yet have enough information to safely answer, which is the reflex that keeps patients alive.

How this fits with what we already know

Vibe Graveyard has a growing shelf of medical-chatbot research, and a fair question is whether this study adds anything new. It does, and the specifics are the point.

Earlier work established the broad problem. A UCLA-led audit in BMJ Open found nearly half of health-chatbot answers were rated problematic by experts (BMJ Open medical chatbot audit). An Oxford randomized trial found that using chatbots for medical questions did not improve people's triage decisions (Oxford AI medical chatbots study). A JAMA Network Open study showed AI models fail at early differential diagnosis more than 80% of the time even while acing final-answer questions (JAMA clinical reasoning study). The ECRI Institute named AI-driven clinical decision support a top health-technology hazard for 2026 (ECRI AI chatbot hazard report).

What the HealthAdvice study contributes is a clean, comparative, per-model safety measurement on the exact kind of question a layperson asks at home, with a concrete catalog of how the answers go wrong. It shows that the safety gap between leading models is large, that the best available model still produces an uncomfortable volume of unsafe advice at scale, and that the underlying flaw is a structural one: these systems answer before they ask.

None of this means chatbots are useless for health. The authors are careful to say the failures look solvable and that the tools have real potential. But potential is not a safety record. Until a chatbot reliably knows when to ask a question instead of confidently answering one, "ask the AI" remains a gamble whose odds the patient cannot see, and whose worst outcomes, as the authors note, are the least likely to ever be traced back to the answer that caused them.

Vibe Graveyard

npj study: public chatbots gave unsafe answers to patient medical questions

Incident Details

Tech Stack

References

What HealthAdvice actually tested

A scorecard with real gaps between models

What "unsafe" actually looked like

A deeper failure: no history-taking

How this fits with what we already know

Discussion