Google DR AI stumbled in Thai clinics

Diabetic retinopathy is the fastest-growing cause of blindness worldwide. The condition occurs when high blood sugar damages blood vessels in the retina, leading to bleeding and swelling that can permanently destroy vision if left untreated. Roughly 415 million people with diabetes are at risk. Detection is straightforward in principle - a trained specialist examines photographs of the retina - but the bottleneck is access to that specialist, particularly in rural or under-resourced areas where the wait between photo and diagnosis can stretch from days to weeks.

Google Health set out to close that gap. Its deep learning system could analyze retinal scans and flag signs of diabetic retinopathy with over 90 percent accuracy in lab conditions, performing at a level the company described as equivalent to a medical specialist. The system could return a result in under 10 minutes. On paper, it was a straightforward improvement: replace the multi-day specialist review with an instant algorithmic assessment, letting nurses triage patients on the spot.

The field trial told a different story.

The deployment

In partnership with Thailand's Ministry of Public Health, Google Health researchers deployed the deep learning system at 11 clinics across the provinces of Pathum Thani and Chiang Mai between November 2018 and August 2019. The study, led by Emma Beede and colleagues at Google and published at the CHI 2020 conference in April 2020, was one of the first published evaluations examining how a deep learning health tool actually performs when real nurses use it on real patients in real clinical settings.

Researchers made regular visits to each clinic over the eight-month period, observing how diabetes nurses handled eye screenings and interviewing them about their experiences. Patients who agreed to participate were medically supervised during the study.

The clinics were typical of community-level healthcare in Thailand. Nurses, not ophthalmologists, performed the retinal photography. The photos were then fed into the AI system, which returned one of three results: positive (signs of diabetic retinopathy detected), negative, or ungradable (image quality too poor to assess).

The lighting problem

The most immediate failure point was image quality. Of the 1,838 retinal photographs captured during the study, 21 percent were graded by the system as too low-quality to process. The primary culprit was lighting. Clinic examination rooms varied widely in their setup, and many lacked the controlled lighting conditions that the algorithm expected.

Poor lighting had always been a background factor for nurses taking retinal photos. Under the old workflow, a specialist reviewing the images could often interpret a slightly dim or blurred photograph using clinical judgment and context from the patient's records. The algorithm could not. It applied a strict quality threshold, and anything below that threshold was rejected with an "ungradable" result.

"Poor lighting conditions had always been a factor for nurses taking photos, but only through using the deep learning system did it present a real problem, leading to ungradable images and user frustration," the research team wrote in their findings.

The frustration was tangible. Nurses would photograph a patient's retina, wait for the system to process it, receive an ungradable result, adjust the lighting or angle, retake the photo, and try again. Each cycle consumed time that the system was supposed to save.

The connectivity problem

The second failure point was infrastructure. The AI model ran in the cloud, meaning each retinal image had to be uploaded over the internet for processing. Internet speeds at the clinics were often slow, turning what should have been a near-instant assessment into a drawn-out wait. One clinic worker estimated they could screen only about 10 patients in a two-hour window - a pace that made the AI-assisted workflow slower, in practice, than what some clinics achieved without AI assistance.

This was not a failure of the algorithm itself. The model's accuracy in ideal conditions was genuine. But the deployment assumed reliable connectivity that the deployment sites could not provide. A system designed for bandwidth-rich environments was placed in bandwidth-constrained ones in rural Thailand, and nobody had solved the gap before going live.

The referral cascade

When the system returned an ungradable result, the original protocol called for an automatic referral to an ophthalmologist. This made a certain kind of sense from a safety perspective: if the AI cannot determine whether a patient has diabetic retinopathy, err on the side of caution and send them to a specialist.

In practice, it meant that patients with nothing wrong were being told they needed to see an eye doctor. In rural Thailand, that often meant traveling to a distant facility, missing work, and incurring costs that many families could not easily absorb. The anxiety of receiving what felt like a preliminary diagnosis - even though the system was simply saying "I can't tell from this image" - added a psychological burden on top of the logistical one.

The researchers recognized this problem partway through the study and amended the protocol. Instead of automatic referrals for ungradable images, an eye specialist would review the ungradable photograph alongside the patient's medical records and determine whether a referral was actually warranted. The change reduced unnecessary travel and false positive anxiety, but it also reintroduced the specialist review step that the AI was supposed to eliminate.

The workflow mismatch

The broader issue was one of workflow design. The deep learning system was built under assumptions about the clinical environment that did not hold in the field. It assumed consistent image quality (which required consistent lighting and camera handling). It assumed reliable internet (which required infrastructure upgrades that had not been made). It assumed that an ungradable result was a rare exception rather than a one-in-five occurrence.

Each of these assumptions was reasonable in a lab setting. None of them were validated in the deployment environment before the system went live. The result was a tool that introduced new failure modes into a workflow that, while slow, was at least predictable. Nurses had been taking retinal photos and sending them to specialists for years. That process had its own problems - primarily the delay in getting a specialist opinion - but it did not produce ungradable results, unnecessary referrals, or connectivity-dependent bottlenecks.

"Despite being designed to reduce the time needed for patients to receive care, the deployment of the system occasionally caused unnecessary delays for patients," the team acknowledged.

The response

Google's research team did not try to bury the findings. The CHI 2020 paper was refreshingly candid about what went wrong and why. The team recommended that AI health tools be evaluated in real clinical environments before deployment, that environmental variables like lighting be explicitly planned for, and that system design account for the full range of conditions the tool would encounter.

Hamid Tizhoosh, an AI-in-medicine researcher at the University of Waterloo, praised the study's honesty. "This is a crucial study for anybody interested in getting their hands dirty and actually implementing AI solutions in real-world settings," he said. He noted that the study was a timely reminder that lab accuracy is just the first step.

The study also identified a genuine benefit: when the system did work - when the image was graded and a clear result returned - nurses reported feeling more confident in their assessments. Real-time positive screenings allowed for quicker referrals than the old send-and-wait process. The AI was not useless. It was useful under conditions that the deployment sites frequently could not provide.

The broader pattern

The Thailand deployment is a textbook case of an AI system that performs well on the metric it was optimized for (accuracy on high-quality images) while failing on the operational context it was placed into. The model did not get dumber in the field. The field was simply different from the lab in ways the development process had not accounted for.

This gap between controlled evaluation and real-world deployment is one of the most consistent patterns in clinical AI. Accuracy on curated datasets, measured under standardized conditions against expert-labeled ground truth, does not predict how a system will behave when the lighting is bad, the internet is slow, the camera operator is a nurse with twelve other responsibilities, and the patient is anxious. Each of those variables introduces friction that pure accuracy metrics do not capture.

Google's team published the friction rather than suppressing it. The paper became a widely cited cautionary reference for teams working on healthcare AI deployment - a rare instance where the gap between the lab and the clinic was documented by the same organization that built the tool.

Vibe Graveyard

Google DR AI stumbled in Thai clinics

Incident Details

Tech Stack

References