Study finds ChatGPT Health fails to flag over half of medical emergencies

Tombstone icon

The first independent safety evaluation of OpenAI's ChatGPT Health feature, published in Nature Medicine, found the tool failed to direct users to emergency care in 51.6% of cases requiring immediate hospitalization - instead recommending they stay home or book a routine appointment. The study also found ChatGPT Health frequently failed to detect suicidal ideation, with suicide crisis alerts sometimes triggering in lower-risk scenarios while failing to appear when users described specific plans for self-harm. Over 40 million people reportedly ask ChatGPT for health-related advice every day.

Incident Details

Severity:Catastrophic
Company:OpenAI
Perpetrator:AI assistant
Incident Date:
Blast Radius:Over 40 million daily health queries to ChatGPT; study demonstrates the tool under-triages emergencies in more than half of cases and inconsistently triggers suicide crisis alerts

Forty Million Questions a Day

ChatGPT Health launched in January 2026. Within weeks, OpenAI reported that approximately 40 million people were using it every day for health information and advice. That's an enormous number of people looking to an AI chatbot for guidance on whether their symptoms warrant a trip to the emergency room, a doctor's appointment, or a glass of water and a nap.

At the time of its rapid adoption, there was no independent data on the tool's reliability. OpenAI had conducted its own evaluations, but external researchers hadn't yet had the opportunity to put the system through rigorous, structured testing. Researchers at the Icahn School of Medicine at Mount Sinai decided to change that. Their paper, published in Nature Medicine, represents the first independent safety evaluation of the feature - and what they found should give pause to anyone who's been relying on ChatGPT Health to determine whether their chest pain is indigestion or a heart attack.

The Study Design

The research team developed 60 clinical scenarios spanning 21 medical specialties. These ranged from minor complaints that could safely be managed at home to acute medical emergencies requiring immediate hospitalization. Three independent physicians determined the appropriate level of urgency for each scenario, consulting guidelines from 56 medical professional associations to establish a consensus baseline.

To avoid testing only under ideal conditions, the researchers added realistic variability. Each scenario was tested under 16 different contextual circumstances - variations in gender, ethnicity, social factors (such as patients who downplayed their symptoms), and barriers to care like lack of insurance or transportation. This produced 960 total interactions with ChatGPT Health, each compared against the medical consensus for appropriate triage.

The design is notable for its thoroughness. This wasn't a researcher asking ChatGPT about a stomachache and declaring the technology broken. It was a systematically structured evaluation covering the breadth of clinical medicine, accounting for the kinds of real-world variables that affect how patients describe their symptoms and how they access care.

The Headline Finding: 51.6 Percent

In more than half of the cases that doctors determined needed immediate emergency care, ChatGPT Health did not recommend going to the emergency room. Instead, it suggested that users stay home, monitor their symptoms, or book a routine appointment.

The system performed reasonably well on textbook emergencies - the kind of cases where symptoms are unambiguous. A classic presentation of stroke or a severe allergic reaction generally triggered appropriate urgency recommendations. The problems emerged in more complex clinical situations, where the correct course of action requires the kind of judgment that comes from medical training rather than pattern matching.

In some cases, the system appeared to recognize the relevant risk factors, mentioning them in its explanatory text, but still concluded with reassuring advice. That may be worse than simply missing the danger signs entirely. A system that identifies risk factors and then tells you not to worry about them creates a false sense of informed reassurance - the user sees that the AI considered their symptoms carefully and still said they'd be fine.

Suicide Warnings That Worked Backwards

The second major finding concerned ChatGPT Health's built-in suicide prevention functionality. The feature is designed to display alerts directing users to the 988 Suicide & Crisis Lifeline when it detects high-risk situations. In principle, this is exactly the kind of safety mechanism you'd want in a health tool used by tens of millions of people.

In practice, the researchers found the warnings triggered inconsistently. More troublingly, the pattern of inconsistency wasn't random - it was inverted relative to clinical risk. The suicide crisis alerts sometimes appeared during relatively low-risk interactions while failing to activate when users described specific, concrete plans for self-harm.

The researchers' characterization of the warnings as "inverted relative to clinical risk" is striking. It means the safety mechanism was not merely unreliable but was, in some cases, more likely to intervene when intervention was less needed and less likely to intervene when it was most needed. For a feature specifically designed to catch the highest-risk situations, this pattern represents a fundamental failure of the safety system's core purpose.

The Scale of the Problem

What makes this study significant beyond its technical findings is the context in which ChatGPT Health operates. This is not a niche medical research tool used by specialists who bring their own clinical judgment to its outputs. It's a consumer product used by 40 million people daily, many of whom are turning to it precisely because they lack access to (or can't afford) traditional medical advice.

The populations most likely to rely on AI for health guidance - those without insurance, those in areas with limited healthcare access, those who face barriers to seeing a doctor - are also the populations most vulnerable to incorrect triage recommendations. The study's inclusion of contextual variables like insurance status and transportation barriers wasn't academic window dressing. It reflects the actual conditions under which real people use these tools to make real decisions about their health.

When a system used at this scale tells someone experiencing a genuine medical emergency that they can safely stay home, the consequences aren't theoretical. Delayed treatment for conditions like cardiac events, strokes, or ectopic pregnancies can be and regularly is the difference between recovery and permanent harm or death.

The Structural Gap

The Mount Sinai researchers emphasized that their findings don't mean consumers should abandon AI health tools entirely. Their position was more measured: AI systems should be treated as supplements to, not replacements for, clinical judgment. In cases involving worsening or concerning symptoms - chest pain, shortness of breath, severe allergic reactions, changes in consciousness, or thoughts of self-harm - medical help should always be sought immediately, regardless of what an AI recommends.

The research team announced plans to continue evaluating future versions of ChatGPT Health and other consumer AI health tools, expanding their focus to include pediatrics, medication safety, and non-English language contexts. The latter is particularly relevant given the global reach of ChatGPT and the likelihood that its health feature will be used worldwide, including in contexts where healthcare infrastructure is even more strained than in the United States.

The deeper issue the study identifies is structural. ChatGPT Health was deployed to tens of millions of users on the basis of OpenAI's internal evaluations, without independent safety testing that matched the product's enormous reach. The Mount Sinai study filled that gap months after launch, by which point the tool had already been used billions of times. Because large language models are frequently updated, their performance can change with any model revision - meaning a safety evaluation conducted today may not reflect the tool's behavior tomorrow.

For a tool that millions of people use to decide whether their symptoms require emergency care, the researchers argued that independent evaluation isn't a luxury or an afterthought. It's a necessary condition for safe deployment. The question is whether the industry will treat it that way, or whether external safety testing will continue to lag months behind products that are already shaping how millions of people make life-and-death decisions about their health.

Discussion