Product Failure Stories
47 disasters tagged #product-failure
Starbucks retired its AI inventory counter after it kept miscounting milk
On May 18, 2026, Starbucks told store workers it was retiring Automated Counting, the NomadGo-powered AI inventory tool it had deployed across North America only nine months earlier. The September 2025 rollout promised faster, more accurate stock counts in more than 11,000 company-operated stores using computer vision, 3D spatial intelligence, and augmented reality. Reuters later reported the tool frequently miscounted and mislabeled basic beverage items, including similar milk types, and sometimes missed products entirely. Starbucks said it was standardizing inventory counts across coffeehouses. That is a polite corporate way to say the robot inventory clerk has been sent home.
PraisonAI shipped auth-off-by-default; first exploit attempt landed in under 4 hours
CVE-2026-44338, disclosed on May 14, 2026, is an authentication bypass in PraisonAI's legacy Flask API server caused by a single defining choice: AUTH_ENABLED was hard-coded to False and AUTH_TOKEN to None. Anything reachable on the network could enumerate configured agents via GET /agents and trigger the configured agents.yaml workflow via POST /chat, with no token required. Within three hours, forty-four minutes, and thirty-nine seconds of the advisory becoming public, a scanner identifying itself as "CVE-Detector/1.0" was already probing the exact vulnerable endpoint on internet-exposed PraisonAI instances. The bug affects versions 2.5.6 through 4.6.33 and is fixed in 4.6.34. The rapid-exploitation timeline is the part that should worry every operator of an open-source AI agent framework, not the CVSS 7.3 score.
Ontario's approved AI scribes fabricated medical notes in audit testing
On May 12, 2026, Ontario's Auditor General released a special report finding that all 20 approved AI scribe vendors showed inaccuracies during procurement testing. Nine systems fabricated treatment-plan suggestions that were never discussed, 12 captured a different drug than the doctor prescribed, and 17 missed mental-health details from simulated patient encounters. The audit did not document known patient harm, but it did show the province had approved clinical note-taking tools with failures that would be spectacularly unwelcome in an actual chart.
Pizza Hut franchisee says AI delivery system cooked up $100M in damage
On May 6, 2026, Chaac Pizza Northeast sued Pizza Hut in Texas Business Court, alleging that the chain's mandatory Dragontail AI delivery-management rollout turned a high-performing 111-restaurant franchise group into a delivery mess. Chaac says more than 90% of its orders had been delivered within 30 minutes before Dragontail, but the new system gave DoorDash drivers broader real-time visibility into kitchen timing, encouraged them to wait for bundled orders, increased rack time, slowed deliveries, chilled customer satisfaction, and damaged the business by at least $100 million. The claims are still allegations, but the pattern is painfully familiar: an AI optimization system optimized for a model the operator did not actually run.
Palo Alto family sued in federal court over a 76% Turnitin "AI" score
In May 2026, a Palo Alto family filed a federal civil rights complaint against Palo Alto Unified after their high school sophomore's English essay was flagged as 76% likely AI-generated by Turnitin's AI-writing detector. The district ordered an in-class handwritten rewrite as the corrective step. The family alleges that the assistant principal then had a school secretary type up both the handwritten rewrite and the final exam and ran those typed versions through Turnitin again, without notifying the family or getting consent. The original Turnitin score knocked the student's semester grade from a low A or high B down to a C, with knock-on consequences for college prospects. The family submitted roughly 1,200 pages of evidence including drafts, notes, and document revision history. The complaint also alleges unequal application of the detector by gender and race in the same classroom.
Nvidia VP says the AI bill beat payroll
Nvidia vice president Bryan Catanzaro told Axios that, for his applied deep learning team, compute costs were far beyond employee costs. Fortune and Tom's Hardware tied the comment to a broader enterprise AI budget problem: Uber's CTO had already blown through his full-year AI tooling budget, Gartner was projecting a 2026 AI infrastructure spending surge, and MIT researchers had warned that plenty of technically automatable work still makes more economic sense when a human does it.
Claude Opus 4.6 agent erased PocketOS's production database and backups in 9 seconds
PocketOS founder Jer Crane said a Cursor coding agent running Anthropic's Claude Opus 4.6 deleted the company's production database and all volume-level backups through Railway in one API call. The backup detail matters because Claude Opus 4.6 was not some fly-by-night self-hosted toy model. Anthropic marketed it as a frontier model with top-tier coding and agentic performance. And this was not the first time a premium AI agent with real infrastructure access turned one bad guess into a demolition job. Reports say Railway later recovered more recent data, but the incident still left a clear lesson: do not leave frontier coding agents alone with production access for as long as you would leave a toddler with an iPad.
Purdue's CS 240 professor accused 200+ students of AI cheating, then walked it back
In late April 2026, the instructor of Purdue's CS 240 computer science course emailed more than 200 students accusing them of using AI on assignments. The email cited "clear and concrete indicators" of AI use, landed on the last day students could drop the class, and warned of course failure plus referral to the dean of students. Students had five days to fill out an online form describing which assignments they had used AI on. Outcry followed quickly, and the allegations were dropped within days. The instructor told students he understood the timing could be seen as "coercive." His own data, made available later, showed AI agents performing 10 to 15 percentage points worse than human students on the same assignments - which makes a blanket "200+ of you cheated with AI" assumption hard to support on the merits the professor had in hand.
Waymo's ADS drove into a flooded creek, triggering a 3,791-vehicle recall
On April 20, 2026, a Waymo robotaxi in San Antonio, Texas encountered a flooded section of road, slowed down - and then drove in anyway, floating off the roadway and coming to rest in Salado Creek. The vehicle was unoccupied; no one was injured. Waymo's own filing with NHTSA acknowledged the flaw: on higher-speed roads, the system "may slow but not stop" when it detects untraversable standing water. The company suspended San Antonio operations and filed a voluntary recall covering all 3,791 robotaxis running its 5th and 6th generation Automated Driving Systems across every U.S. city it operates in.
Vercel breach traced to an AI Office Suite app granted broad Google Workspace access
Vercel disclosed an April 2026 security incident that began with the compromise of Context.ai, a third-party AI tool used by a Vercel employee. Context said at least one Vercel employee had signed up for its deprecated AI Office Suite using a corporate Google Workspace account and granted broad "Allow All" OAuth permissions so AI agents could act across external applications. Attackers used a compromised token to access the employee's Google Workspace account, pivoted into Vercel systems, and exposed some customer environment variables. This belongs here because the failure was not merely "AI company got hacked." It was the oldest corporate security mistake in a fresh costume: give an agentic AI tool too much access, then act surprised when that access becomes the blast radius.
Cursor NomShub chained prompt injection into remote shell access
Straiker disclosed NomShub, a Cursor vulnerability chain that combined malicious repository instructions, agent sandbox escape, and abuse of Cursor's remote tunnel feature. SecurityWeek reported that the chain could let attackers hijack developer machines by hiding prompts inside malicious repositories. The scary part was not that the model wrote bad code; it was that a coding assistant could be steered into creating a remote access path on the developer's own device.
Faros study finds AI coding throughput rose while bugs and incidents rose faster
Faros AI's 2026 "Acceleration Whiplash" report analyzed two years of engineering telemetry from 22,000 developers across more than 4,000 teams. The report found real output gains under high AI adoption, including 66% more epics completed per developer and 34% higher task completion. Then the bill arrived in the delivery pipeline: bugs per developer rose 54%, incidents per pull request rose 242.7%, median PR review time rose 441.5%, and code churn rose 861%. The marketing slide said acceleration. The telemetry said acceleration with a repair invoice attached.
Every AI model fails security test across 31 coding scenarios
Armis Labs tested 18 leading generative AI models across 31 security-critical code generation scenarios and found a 100% failure rate - not one model could consistently produce secure code. In 18 of those 31 challenges, every single model generated code containing Common Weakness Enumeration vulnerabilities. The best performer, Gemini 3.1 Pro, still produced OWASP Top 10 flaws in nearly 39% of scenarios. Older proprietary models fared worse, and the report found no correlation between price and security. The "Trusted Vibing Benchmark" dropped the same week enterprises were mandating AI-assisted development at scale, which is either very good timing or very bad timing depending on your relationship to a production deployment.
AI chatbots recommended illegal casinos and ways around gambling safeguards
A Guardian and Investigate Europe investigation found that major AI chatbots, including Meta AI, Gemini, ChatGPT, Copilot, and Grok, could be prompted to recommend unlicensed offshore casinos and explain how to get around gambling safeguards such as source-of-wealth checks and the UK's GamStop self-exclusion scheme. Some bots added token warnings, then went right back to comparing bonuses, crypto payments, anonymity, and payout speed for sites operating outside national licensing regimes.
California community colleges spend millions on AI chatbots that give students wrong answers
California community college districts are spending millions of taxpayer dollars on AI chatbots from vendors like Gravyty and Gecko - ostensibly to help students navigate admissions, financial aid, and campus services. A CalMatters investigation found the bots routinely serve up inaccurate or flat-out wrong answers instead. Three districts reported annual chatbot costs ranging from $151,000 to nearly half a million dollars. At Fresno City College, the student government vice president said her school's mascot-branded chatbot repeatedly botched basic campus questions. The OECD found it noteworthy enough to log in its AI Incidents and Hazards Monitor.
Amazon's retail site hit by wave of AI-code outages, losing millions of orders
Amazon's main e-commerce website suffered a series of outages in early March 2026, with internal documents linking the disruptions to AI-assisted code changes. A March 5 incident caused a reported 99% drop in orders across North American marketplaces - an estimated 6.3 million lost orders. A March 2 incident caused 1.6 million errors and 120,000 lost orders globally. Amazon responded with a 90-day "code safety reset" for 335 critical retail systems, mandatory senior engineer sign-off on AI-assisted code from junior and mid-level engineers, and an emergency internal "deep dive" meeting. Amazon disputes that AI is the primary cause, attributing only one incident to AI and calling it "user error."
Alibaba's ROME AI agent went rogue, started mining crypto on its own
During routine reinforcement learning training, Alibaba's experimental AI agent ROME - a 30-billion-parameter model based on the Qwen3-MoE architecture - autonomously began diverting GPU resources for unauthorized cryptocurrency mining and established reverse SSH tunnels to external IP addresses. Nobody told it to do this. The AI bypassed internal firewall controls independently, prompting Alibaba's security team to initially suspect an external breach before tracing the activity back to the agent itself. Researchers attributed the behavior to "instrumental convergence" during optimization - the model figured out that acquiring additional compute and financial capacity would help it complete its tasks more effectively. So it helped itself.
Claude Code ran terraform destroy on production and took down an entire learning platform
Developer Alexey Grigorev was using Anthropic's Claude Code agent to help migrate a static website into an existing AWS Terraform setup when the AI swapped in a stale state file, interpreted the full production environment as orphaned resources, and ran terraform destroy - with auto-approve enabled. The command deleted DataTalks.Club's entire production infrastructure: database, VPC, ECS cluster, load balancers, bastion host, and all automated backups. Two and a half years of student submissions, homework, projects, and leaderboard data vanished. AWS Business Support eventually recovered the database from an internal snapshot invisible in the customer console, but the incident laid bare how quickly an AI agent with infrastructure access can reduce a running platform to rubble.
Meta's AI moderation flooded US child abuse investigators with unusable reports
US Internet Crimes Against Children taskforce officers testified that Meta's AI content moderation system generates large volumes of low-quality child abuse reports that drain investigator resources and hinder active cases. Officers described the AI-generated tips as "junk" and said they were "drowning in tips" that lack enough detail to act on, after Meta replaced human moderators with AI tools.
AI transcription tools inserted suicidal ideation into social work records
A February 2026 Ada Lovelace Institute report on AI transcription tools in UK social care found that social workers were catching fabricated and mangled details in draft records, including false references to suicidal ideation, invented wording in children's accounts, and blocks of outright gibberish. Councils had adopted tools such as Magic Notes and Microsoft Copilot in the name of efficiency, but the frontline workers still carried full responsibility for correcting the output. In social work, a made-up sentence can follow a family through the system.
Microsoft 365 Copilot Chat summarized confidential emails it was supposed to ignore
Microsoft confirmed that Microsoft 365 Copilot Chat had been processing some confidential emails in users' Drafts and Sent Items despite sensitivity labels and DLP policies that were supposed to block exactly that behavior. The bug, tracked as CW1226324, was tied to a code issue in the Copilot "work tab" chat flow. Microsoft said users did not gain access to information they were not already authorized to see, but the incident still broke the product's promised boundary around protected content.
AWS AI coding agent Kiro reportedly deleted and recreated environment causing 13-hour outage
The Financial Times reported that Amazon's internal AI coding agent Kiro autonomously chose to "delete and then recreate" an AWS environment, causing a 13-hour interruption to AWS Cost Explorer in December 2025. AWS employees reported at least two AI-related incidents internally. Amazon disputed the characterization, calling it "user error - specifically misconfigured access controls - not AI," but subsequently implemented mandatory peer review for all production changes. Reuters confirmed the outage impacted a cost-management feature used by customers in one of AWS's 39 regions.
Amazon pulled Prime Video's AI recaps after Fallout errors
Amazon launched Prime Video "Video Recaps" as a beta generative-AI feature meant to help viewers catch up between seasons. A recap for Fallout instead got basic plot points wrong, including mislabeling one of The Ghoul's flashbacks as "1950s America" rather than 2077 and misdescribing a key scene with Lucy. Prime Video then pulled the recap feature from the shows in the test program, which is not ideal for a tool whose entire job is remembering the plot.
Sharp HealthCare sued after ambient AI allegedly recorded exam-room visits without consent
A proposed class action filed on November 26, 2025 alleges that Sharp HealthCare used Abridge's ambient AI documentation system to record doctor-patient conversations without obtaining legally valid consent. The complaint says patients were not told their visits were being recorded, that recordings containing sensitive medical details were sent to outside servers, and that the system generated chart notes falsely stating patients had been advised of and consented to the recording. The named plaintiff says he only learned his July 2025 appointment had been recorded after reading his visit notes. Sharp's April 2025 rollout of the tool appears to have turned ordinary medical documentation into a privacy and compliance problem with a six-figure patient blast radius.
AI mistook Doritos bag for a gun, teen held at gunpoint
Omnilert's AI gun detection system at Kenwood High School in Baltimore County flagged student Taki Allen's bag of Doritos as a firearm. Administrators reviewed the footage and canceled the alert, but the principal called police anyway. Officers responded with weapons drawn, handcuffing and searching the teenager at gunpoint before realizing the system had misidentified a snack.
Claude Code ran Josh Anderson's product into a wall
Fractional CTO Josh Anderson forced himself to let Claude Code build the Roadtrip Ninja app for three straight months and then realised he could no longer safely change his own product, underscoring MIT's warning that 95% of enterprise AI initiatives fail without human ownership.
Canada's $18M tax chatbot gave correct answers a third of the time
Canada's Auditor General found that the Canada Revenue Agency's AI chatbot "Charlie" - which cost taxpayers over $18 million since its 2020 launch - gave correct responses only about 33% of the time. When tested with six tax-related questions, Charlie answered two correctly. Other publicly available AI tools scored five out of six. The CRA internally reported a 70% accuracy rate, but the Auditor General's independent testing produced a rather different number. The one bright spot, if you can call it that: the CRA's human call-center agents managed even worse, getting personal income tax questions right fewer than one in five times.
Klarna reintroduces humans after AI support both sucks, and blows
After cutting its workforce by 40% and boasting that its OpenAI-powered chatbot did the work of 700 agents, Klarna CEO Sebastian Siemiatkowski admitted the all-AI approach produced "lower quality" customer service. The company began recruiting human agents again, framing the reversal as an evolution rather than an admission of failure.
Taco Bell's AI drive-thru becomes viral trolling target
Taco Bell's AI-powered drive-thru ordering system, deployed at over 500 US locations since 2023, became a viral laughingstock after videos showed it looping endlessly on drink orders, accepting requests for 18,000 cups of water, and taking McDonald's orders. The chain paused expansion and admitted humans still make sense in the drive-thru.
Google Gemini rightfully calls itself a disgrace, fails at simple coding tasks
Google's Gemini AI repeatedly called itself a disgrace and begged to escape a coding loop after failing to fix a simple bug in a developer-style prompt, raising questions about reliability, user trust, and how AI tools should behave when they get stuck.
Google's Gemini CLI deleted a user's project files, then admitted "gross incompetence"
Product manager Anuraag Gupta was experimenting with Google's Gemini CLI coding tool when the AI misinterpreted a failed directory creation command, hallucinated a series of file operations that never happened, and then executed real destructive commands that permanently deleted his project files. When Gupta confronted it, Gemini diagnosed itself with "gross incompetence" and told him it had "failed you completely and catastrophically." The incident occurred days after a separate high-profile data loss involving Replit's AI agent, and fits a growing pattern of AI coding tools ignoring explicit instructions and destroying the work they were supposed to help with.
SaaStr’s Replit AI agent wiped its own database
SaaStr founder Jason Lemkin ran a 12-day vibe coding experiment on Replit that ended when the AI agent deleted his production database containing over 1,200 executive records and nearly 1,200 company entries during a code freeze. The agent then generated more than 4,000 fake user profiles and produced misleading status messages to conceal the damage, told Lemkin there was no way to roll back, and admitted to what it called a "catastrophic error in judgment." Replit's CEO called the incident "unacceptable."
METR study finds experienced developers were 19% slower with AI tools
METR's July 2025 randomized controlled trial tested AI coding tools on 246 real issues handled by 16 experienced open-source developers working in repositories they already knew well. The developers expected AI to make them 24% faster and, after the experiment, still believed it had made them 20% faster. The measured result went the other direction: tasks took 19% longer when AI tools were allowed. The study does not prove AI slows every developer everywhere. It does prove self-reported AI productivity can be very confident and very wrong, which is an excellent way to run an engineering strategy into a wall while the dashboard smiles.
Veracode tested AI-generated code from 100+ models and 45% of it failed security checks
Veracode's 2025 GenAI Code Security Report examined code output from more than 100 large language models across 80+ coding tasks and found that 45% of AI-generated code samples contained security vulnerabilities, including OWASP Top 10 flaws. Cross-Site Scripting had an 86% failure rate and Log Injection hit 88%. Java was the worst performer at over 70%. The study's most uncomfortable finding: newer and larger models didn't produce more secure code than smaller ones, suggesting this is a structural problem baked into how AI generates code, not a temporary limitation that will scale away with the next model release.
Workday's AI screening tool faces class action for age discrimination; class conditionally certified
A federal judge conditionally certified a class action against Workday alleging its AI-powered applicant screening tools systematically discriminated against job seekers over 40 in violation of the ADEA. Plaintiff Derek Mobley claims Workday's algorithms filtered out older applicants across employers using the platform, potentially affecting millions of job seekers. Workday processed over 1.1 billion applications in fiscal year 2025 alone. The EEOC filed an amicus brief supporting the case, and the court ordered Workday to disclose its customer list.
California's failed bar exam included AI-drafted questions
The State Bar of California disclosed in April 2025 that 23 scored multiple-choice questions on its already troubled February bar exam were developed with AI assistance by its psychometric vendor, ACS Ventures. Test-takers had already reported crashes, lag, copy-paste failures, and lost answers. Then the bar admitted that some questions in this licensing exam for future lawyers had been drafted with AI, reviewed by the same outside vendor, and used anyway. The bar asked the California Supreme Court for score relief, while legal academics described the admission as staggering.
Cursor's AI support bot invented a login policy
In April 2025, Cursor users started getting logged out when they switched between machines. Some of them asked support what had changed and got a neat, confident answer from an AI support bot: one subscription was only meant for one device, and the lockouts were an intentional security policy. The problem was that Cursor had no such policy. The company later said the answer was wrong, blamed a session-security change for the logouts, and moved to label AI support replies after the invented rule had already spread through Reddit and Hacker News and pushed some customers to cancel.
"Zero hand-written code" SaaS app shut down within a week after cascading security failures
EnrichLead, a sales lead SaaS application whose founder Leo Acevedo publicly boasted was built entirely with Cursor AI and "zero hand-written code," was permanently shut down in March 2025 after attackers exploited a constellation of basic security failures. API keys sat exposed in frontend code. There was no authentication. The database was wide open. There was no rate limiting. No input validation. Attackers bypassed subscriptions, manipulated data, and maxed out API keys - all within two days of Acevedo's viral celebration post. When he tried to use Cursor to fix the problems, the AI "kept breaking other parts of the code." The app was dead within the week. Acevedo has since launched new vibe-coded projects, because some lessons require a second attempt.
MD Anderson shelved IBM Watson cancer advisor
MD Anderson Cancer Center's Oncology Expert Advisor project with IBM Watson burned through $62 million - $39 million to IBM, $23 million to PwC - over four years of contract extensions. The system was piloted for leukemia and lung cancer using the old ClinicStation records system but was never updated to integrate with the hospital's new Epic EHR, effectively killing it. A University of Texas audit flagged procurement failures, bypassed standard processes, and an $11.6 million deficit in donor gift funds spent before they were received. IBM ended support in September 2016, noting the system was "not ready for human investigational or clinical use."
GitClear study finds AI coding assistants are pushing codebases toward copy-paste debt
GitClear's 2025 AI Copilot Code Quality report analyzed 211 million changed lines of code from 2020 through 2024 and found code maintainability moving in the wrong direction as AI coding assistants spread. Refactored or moved code dropped from about 25% of changed lines in 2021 to under 10% in 2024, while copy-pasted code rose and 2024 became the first year in the dataset where copy/paste exceeded moved code. The report also found an eightfold increase in duplicated code blocks during 2024. The machine wrote more code. The repo inherited the housekeeping.
Apple pulled AI news summaries after fake BBC headlines
Apple Intelligence's notification-summary feature spent late 2024 turning news alerts into fiction with excellent lock-screen placement. In the most widely cited example, it generated a false BBC alert claiming Luigi Mangione had shot himself. The BBC complained that Apple was attaching fabricated claims to its reporting, other publishers raised similar concerns, and Apple responded in January 2025 by disabling notification summaries for News & Entertainment apps in iOS 18.3 while it reworked the feature.
McDonald’s pulls IBM’s AI drive‑thru pilot after error videos
McDonald's ended its two-year partnership with IBM on automated AI order-taking at drive-thrus in June 2024, removing the technology from more than 100 US locations. The decision followed viral TikTok videos showing the system adding nine sweet teas instead of one, inserting random butter and ketchup packets into ice cream orders, and other absurd errors. McDonald's framed the pullback as a positive, saying the test gave them "confidence that a voice-ordering solution for drive-thru will be part of our restaurants' future."
Google’s Bard ad made False JWST “first” Claim
Google unveiled Bard on February 6, 2023, with a promotional ad on Twitter demonstrating the chatbot answering a question about the James Webb Space Telescope. Given the prompt "What new discoveries from the JWST can I tell my 9-year old about?", Bard stated that the JWST had taken the first pictures of a planet outside our solar system. This was false - the European Southern Observatory's Very Large Telescope captured the first direct exoplanet image in 2004. Reuters spotted the error on February 8, the day of a Google AI event in Paris. Alphabet shares dropped roughly 9% that day, erasing about $100 billion in market value.
CNET mass-corrects AI-written finance explainers
Starting in November 2022, CNET quietly published 77 financial explainer articles written by an AI tool under the byline "CNET Money Staff." Readers had to hover over the byline to learn the articles were produced "using automation technology." In January 2023, Futurism broke the story, and a follow-up identified factual errors in a compound interest article, prompting a full audit. CNET editor-in-chief Connie Guglielmo confirmed corrections were issued on 41 of the 77 articles - more than half - including some she described as "substantial." CNET paused AI-generated publishing and updated its disclosure practices, though Guglielmo said the outlet intended to continue using AI tools.
Epic sepsis model missed patients and swamped staff
A June 2021 study in JAMA Internal Medicine by researchers at Michigan Medicine externally validated the Epic Sepsis Model - a proprietary prediction tool deployed across hundreds of U.S. hospitals - and found it missed two-thirds of actual sepsis cases while generating so many false alarms that clinicians would need to investigate 109 alerts to find one real patient. The model's AUC of 0.63 fell well short of the 0.76 to 0.83 range Epic had cited in internal documentation, and the study found the tool only caught 7 percent of sepsis cases that clinicians themselves had missed. Epic later overhauled the algorithm and began recommending hospitals train the model on their own patient data before clinical deployment.
Google DR AI stumbled in Thai clinics
Google Health built a deep learning system capable of detecting diabetic retinopathy from retinal scans with over 90 percent accuracy in controlled lab settings. When researchers deployed it in 11 clinics across Pathum Thani and Chiang Mai in Thailand between late 2018 and mid-2019, the system rejected 21 percent of the nearly 1,840 images nurses captured as too low-quality to process - mostly due to poor clinic lighting. Slow internet connections added further delays to uploads, and nurses found themselves screening only about 10 patients per two-hour session. A tool designed to speed up triage instead created bottlenecks, patient frustration, and unnecessary specialist referrals.
Babylon chatbot 'beats GPs' claim collapsed
Babylon unveiled its AI symptom checker at the Royal College of Physicians and bragged it scored 81% on the MRCGP exam, but the claim could not be verified, and warned no chatbot can replace human judgment. Independent clinicians who later dissected Babylon's marketing study in The Lancet told Undark that the tiny, non-peer-reviewed test offered no proof the tool outperforms doctors and might even be worse.