JAMA study: FDA-cleared AI medical devices get recalled fast, and most were never clinically tested

What the study found

Most of the AI incidents that end up in a graveyard like this one are single events: one chatbot, one wrong answer, one hospital, one bad day. This one is different. It is a study, and the thing it documents is not a single failure but a structural pattern in how AI gets into the hands of doctors and onto patients in the first place.

In August 2025, the journal JAMA Health Forum published "Early Recalls and Clinical Validation Gaps in Artificial Intelligence-Enabled Medical Devices," a paper led by Tinglong Dai of Johns Hopkins, with co-authors spanning Johns Hopkins (including the Bloomberg School of Public Health) and Yale School of Medicine. The researchers pulled the full set of AI-enabled medical devices the U.S. Food and Drug Administration had authorized through November 2024 - 950 of them - and then went looking at what happened after authorization.

The headline numbers:

60 of the 950 devices were tied to 182 recall events, a recall rate of about 6.3%.
About 43% of those recalls happened within the first year of the device being authorized. For comparison, that is roughly double the early-recall rate of conventional 510(k)-cleared devices. The problems were not slow-burning; they showed up fast.
The recall events were tied to over 1.7 million units in the field, so this is not a rounding-error volume of affected hardware and software.
The most common reason for recall was diagnostic or measurement error - the device producing incorrect or inconsistent results. That category alone accounted for more than a hundred of the recalls and the large majority of affected units.

And the finding the authors put at the center of the paper: the recalled devices had, overwhelmingly, never been put through a clinical validation study before reaching the market. Devices that lacked clinical validation were recalled significantly more often than devices that had been validated. Put plainly, the devices that nobody tested on actual patients before shipping were the ones most likely to be pulled back for being wrong.

Why "diagnostic or measurement error" is the scary category

It is worth slowing down on what "diagnostic or measurement error" means in this context, because it is easy to read past as bureaucratic phrasing.

An AI-enabled medical device, in this dataset, is typically something that looks at medical data and tells a clinician something about it. It might flag a suspected stroke or large-vessel occlusion on a CT scan, measure a cardiac output, screen a retinal image for diabetic eye disease, or estimate some quantity that a doctor then acts on. The whole pitch is that the AI sees something, or measures something, more reliably or more quickly than the existing workflow.

When such a device produces an incorrect or inconsistent result, the failure mode is not "the screen looks weird." It is that a clinician may be told a measurement that is wrong, or shown a "nothing to see here" on a scan that actually contains a life-threatening finding, or handed a flag on a scan that is clean. The downstream consequences are the ones medicine spends its entire institutional life trying to avoid: a delayed treatment, a missed diagnosis, an intervention based on a number that was never right. The study frames the recalls precisely this way - errors capable of delaying care or missing serious conditions. That is the difference between a recalled toaster and a recalled diagnostic algorithm.

The validation gap, and why it exists

The most useful thing this study does is point at the mechanism rather than just the body count. The reason so many recalled devices had never been clinically validated is not negligence by any single company. It is the pathway most of them came through.

In the United States, a large share of medical devices reach the market via the FDA's 510(k) clearance route. The core idea of 510(k) is "substantial equivalence": if your device is meaningfully similar to a device already legally on the market (a "predicate"), you can clear it by demonstrating that equivalence, rather than by running fresh prospective clinical trials proving it works on patients. For a lot of traditional hardware, that is a reasonable shortcut - a new blood-pressure cuff does not need to re-prove the concept of blood pressure.

The trouble is that an AI model is not a blood-pressure cuff. Its behavior depends on the data it was trained on, the population it is deployed against, the imaging equipment feeding it, and a dozen other things that can drift between the lab and the clinic. "Substantially equivalent to a predicate" is a weak guarantee that an algorithm will actually perform on the patients in front of it. Yet the 510(k) pathway largely does not require prospective human testing to clear, which is how a device that was never validated on real patients can still arrive in a hospital with an FDA authorization stamped on it. The AHA's writeup of the study makes the same point: 510(k) clearance does not demand prospective clinical testing, so AI devices can enter the market with thin clinical evidence. The study's data then shows the predictable result - those are the devices that get recalled, and recalled early.

There is also an industry-structure wrinkle the authors surface. Publicly traded companies made up only about 53% of the AI devices in the dataset but accounted for the overwhelming majority of recall events and recalled units. The paper does not paint the big public firms as cartoonishly reckless so much as it shows that scale concentrates the blast radius: when a large vendor ships an under-validated device across many sites and many units, a single defect becomes a very large recall.

How confident should you be in this?

This is a study rather than a single smoking-gun incident, so its claim is a statistical one. What the data establishes is a strong association: AI devices that lack clinical validation are recalled more often, and earlier, than those that have it. Association is not the same as a controlled proof that skipping validation directly caused each recall, and the authors are working from FDA recall records, which capture problems serious enough to trigger a recall and not the full distribution of quiet, unrecalled errors. The real-world rate of patients harmed is not something this paper measures; it measures recalls, which are an upstream signal.

But that uncertainty cuts toward more concern, not less. Recalls are the failures visible enough and severe enough to force corrective action. Whatever larger pool of quieter errors exists underneath, these are the ones that broke the surface. A 6.3% recall rate with 43% of recalls inside the first year, concentrated in devices that were never clinically tested, is not a reassuring picture even before you ask about the errors that never rose to the level of a recall.

What it teaches

None of this means AI medical devices are bad. Plenty of them are useful, and a recall is in some sense the system catching a problem. The problem is the order of operations.

The seductive promise of AI in medicine is speed: faster reads, faster screening, faster clearance to market. This study is a measurement of what happens when the speed extends to the part where you are supposed to check that the thing works on real patients. The devices that skipped clinical validation did not, on the evidence, turn out fine and save everyone time. They turned out to be the ones disproportionately yanked back, fast, for producing wrong answers in a domain where wrong answers delay treatment and miss disease.

The authors' recommendation follows directly from the data: more clinical validation before these devices reach patients, and post-market surveillance that watches them in the field the way drug-safety systems watch medications. That is an unglamorous conclusion. It is also the entire point. The graveyard is full of AI systems that were shipped on the assumption that they would be validated later, or in production, or by the users. In medicine, "validated by the users" means "validated on patients," and this study is a count of how often that bill came due.

Vibe Graveyard

JAMA study: FDA-cleared AI medical devices get recalled fast, and most were never clinically tested

Incident Details

Tech Stack

References

What the study found

Why "diagnostic or measurement error" is the scary category

The validation gap, and why it exists

How confident should you be in this?

What it teaches

Discussion