Babylon chatbot 'beats GPs' claim collapsed

Tombstone icon

Babylon unveiled its AI symptom checker at the Royal College of Physicians and bragged it scored 81% on the MRCGP exam, but the claim could not be verified, and warned no chatbot can replace human judgment. Independent clinicians who later dissected Babylon's marketing study in The Lancet told Undark that the tiny, non-peer-reviewed test offered no proof the tool outperforms doctors and might even be worse.

Incident Details

Severity:Facepalm
Company:Babylon Health
Perpetrator:Startup
Incident Date:
Blast Radius:Patient harm, eroded trust, and regulators forced real clinical trials.

Babylon Health was a London-based startup with a pitch that scaled to cosmic proportions. Founded by Ali Parsa, the company promised to "do with healthcare what Google did with information," making medical care "accessible and affordable to every human being on Earth." The flagship product was a symptom-checking chatbot that would triage patients, assess their symptoms through a conversational interface, and either recommend self-care, suggest a GP visit, or direct them to accident and emergency services. The chatbot was part of a larger platform called GP at Hand, which partnered with the NHS and eventually accumulated over 2.3 million UK users who could book medical appointments and have video consultations with doctors through a smartphone app.

On June 27, 2018, Babylon presented its AI system at the Royal College of Physicians in London and made a headline-ready claim: the chatbot had scored 81 percent on the MRCGP exam - the Membership of the Royal College of General Practitioners exam, which serves as the final qualification for trainee general practitioners in the UK. The average human pass mark over the preceding five years had been 72 percent. Babylon's message was: its AI was supposedly outperforming doctors on their own licensing exam.

The claim traveled fast. BBC News covered it. Tech publications amplified it. Babylon's profile rose accordingly.

What was actually tested

The 81 percent figure came from a study that Babylon had conducted internally. The company presented the AI with questions drawn from the MRCGP's applied knowledge test and measured how many it answered correctly. This is less impressive than it sounds.

The MRCGP exam is a multi-component assessment. The applied knowledge test is one part of it - a written exam covering clinical medicine, evidence interpretation, and organizational questions. It does not involve examining patients, interpreting ambiguous symptoms in real time, managing uncertainty, or any of the other tasks that constitute the actual job of a GP. Passing a written multiple-choice test is a constrained information-retrieval task. Diagnosing a patient who walks into a surgery with vague complaints is not.

More critically, the study was not peer-reviewed. It was published on arXiv, a preprint server where papers appear without formal academic vetting. The methodology was not described in detail, and the study's authors were not independent of Babylon Health. The company was, in effect, testing its own product, reporting the results through its own paper, and announcing the conclusions at its own event.

In July 2018, Babylon released a broader evaluation study claiming that its AI diagnostic system performed "on-par with human doctors." This study used simulated clinical vignettes - written descriptions of hypothetical patient presentations - rather than real patients presenting real symptoms in real clinical environments.

The Lancet response

The pushback arrived in November 2018 through a letter published in The Lancet, one of the world's most respected medical journals. Hamish Fraser, an associate professor of medical science at Harvard; Enrico Coiera, a professor of medical informatics; and David Wong, a health informatics lecturer, reviewed Babylon's claims and methodological approach, and their conclusions were not gentle.

The study, they wrote, "does not offer convincing evidence that the Babylon Diagnostic and Triage System can perform better than doctors in any realistic situation." They went further: "there is a possibility that it might perform significantly worse."

Their objections were specific. The sample size was small. The test conditions were artificial. The vignettes were simulated, not based on real patient encounters. The methodology lacked transparency. And the research was conducted by people with a financial interest in the outcome.

Babylon's response to the letter was revealing. The company stated: "As we indicated in our original study, our intention was not to demonstrate or claim that our AI system is capable of performing better than doctors in natural settings." This was a significant retreat from the June press event, where the 81 percent headline had very clearly implied superiority over human GPs to a general audience. The company was, in its formal response to academic critics, walking back the very claim that had generated its media coverage.

Safety concerns from the front lines

The academic debate ran in parallel with practical safety concerns raised by clinicians who actually used the chatbot. David Watkins, an NHS consultant oncologist, recorded examples of the symptom checker failing to recognize serious conditions from obvious symptom descriptions. In one documented case, the chatbot appeared to fail to identify symptoms consistent with a heart attack.

Watkins reported his concerns to the MHRA (Medicines and Healthcare products Regulatory Agency), the UK's medical devices regulator. In response, Dr. Duncan McPherson, the MHRA's clinical director for devices, wrote to Watkins in terms that were unusually direct for a regulator: "Your concerns are all valid and ones that we share."

The MHRA began reviewing the complaints. Safety regulators examining a symptom-checking tool used by millions of NHS patients is not a routine regulatory footnote. It reflected genuine uncertainty, at the regulator level, about whether the tool was safe for the purpose it was being marketed for.

The marketing-to-evidence gap

The core issue with Babylon's MRCGP claim was not that the chatbot scored poorly on a test. It scored well. The issue was the inferential leap from "scored 81 percent on a written exam" to "matches or beats GPs at clinical medicine." These are different capabilities. A medical student who aces their written finals but freezes in their first clinical rotation is not unusual. Written knowledge and clinical judgment are related but distinct skills, and the MRCGP applied knowledge test evaluates only one of them.

Babylon's marketing blurred that distinction. The presentation at the Royal College of Physicians, the media coverage, the comparison to human pass rates - all of it created an impression that the chatbot could do what a GP does, just faster and cheaper. The formal study, when scrutinized by independent researchers, could not support that impression.

This is a pattern common to health-tech startups navigating a hostile-to-hype regulatory environment. The incentive structure rewards bold public claims that attract funding, media attention, and user adoption. The scientific process, with its insistence on independent validation, transparent methodology, and peer review, operates on a different timeline and to different standards. When the two collide, the claims tend to get there first and the corrections arrive later.

The eventual trajectory

Babylon's story did not end with the MRCGP controversy, but the controversy was an early indicator of the gap between the company's public statements and its evidence base. In June 2020, a data breach in the GP at Hand app allowed one patient to access video recordings of another patient's consultation - a different kind of failure, but one that fit the broader picture of a company moving faster than its quality controls could keep up.

Babylon went public on the New York Stock Exchange in October 2021, reaching a valuation of 4.2 billion dollars. By the summer of 2023, the company had filed for bankruptcy. A former employee, quoted in The Week, described the operation as "smoke and mirrors."

Ali Parsa's original pitch - expanding access to healthcare through AI - was not inherently absurd. Symptom-checking tools can play a useful role in triage. The problem was that Babylon sold its chatbot as something it had not demonstrated it could be: a clinical-grade diagnostic system on par with trained physicians. The MRCGP claim was the purest example of that gap - an exam score presented as proof of clinical competence, accepted by a press corps eager for another AI-beats-the-doctors narrative, and dismantled by independent researchers who applied the standards the company's own study had skipped... ಠ_ಠ

Discussion