Ontario's approved AI scribes fabricated medical notes in audit testing

Procurement testing did the thing testing is for

Ontario's Auditor General released a special report on May 12, 2026 about artificial intelligence use across the provincial government. One section focused on AI scribes: tools meant to listen to clinical encounters and generate structured medical notes for physicians and other health-care professionals.

The pitch is easy to understand. Doctors spend too much time documenting care. AI scribes promise to reduce clerical load by turning conversations into notes. In a sane deployment, the tool saves time while the clinician remains accountable for the chart. In an unsafe deployment, the tool quietly invents or mangles clinical facts and the human reviewer misses the error because the note looks professionally composed. Medicine already has enough ways to go sideways without autocomplete adding imaginary blood tests to the record.

Ontario's procurement testing used two simulated doctor-patient recordings. Vendors generated notes from the same recordings, and medical professionals from OntarioMD and Ontario Health evaluated whether those notes accurately summarized the encounters. The results were bad in a way that is usefully specific.

All 20 approved vendors showed at least one category of inaccuracy. Nine of 20 systems hallucinated information, including treatment-plan suggestions such as referrals for therapy or orders for blood tests that were not discussed in the recordings. Evaluators also saw notes saying things like no masses were found, or that a patient had anxiety, when those facts had not appeared in the encounter. Twelve of 20 systems captured a different drug than the one prescribed by the doctor. Seventeen of 20 missed key details about patients' mental-health issues in at least one of the tests.

This is exactly why simulated testing exists. Better to discover the wrong-drug note in a procurement file than in a patient's chart after everyone has gone home and the portal has done its little "new clinical note available" shuffle.

Accuracy was treated like a side dish

The audit did not only criticize the tools. It criticized how Ontario scored them.

In the second stage of the request-for-bids process, the "accuracy of medical notes generated" criterion was worth 20 points out of 530, or 4 percent of the total score. A bidder could score zero on accuracy, system security controls, or bias controls and still meet the minimum aggregate score required to be approved as a vendor of record.

That is the kind of spreadsheet decision that looks defensible only while everyone agrees not to imagine the product being used by real people. A clinical note is not an optional feature. It is the record physicians use to remember what happened, communicate with other clinicians, support billing, justify referrals, and make future decisions. If the note records the wrong drug or omits mental-health details, the defect is not buried in a harmless admin field. It sits in the medical record, dressed like fact.

The Auditor General warned that inadequate weighting for accuracy, security, privacy, and bias could lead to selected vendors whose tools produce inaccurate or biased records or lack controls for sensitive personal health information. The report also found documentation gaps. Eleven of the 20 approved vendors did not submit SOC reports, HITRUST certification, or ISO 27001 certification. Five did not submit threat-risk assessments or privacy-impact assessments, even though those documents were supposed to be part of the process.

A manual-review escape hatch

OntarioMD issued guidance telling doctors to manually review AI-generated notes for accuracy. That guidance is sensible, but the audit found the approved AI scribe systems did not require doctors to attest through a sign-off feature that they had reviewed the notes.

That is a control gap. "The doctor should check it" is a policy sentence. A required attestation is a workflow control. The first one lives in a PDF. The second one interrupts the person at the point where the risk happens. Anyone who has worked around production systems knows the difference. A warning that is not wired into the workflow becomes office wallpaper, and office wallpaper has a poor record of preventing data-quality incidents.

The Register reported that an Ontario Health Ministry spokesperson told CBC more than 5,000 physicians were participating in the AI scribe program and that there had been no known reports of patient harms associated with the technology. That boundary is important. This story should not be inflated into "AI scribes harmed Ontario patients" unless evidence emerges that they did. The verified story is narrower and still plenty serious: procurement testing found approved clinical-note tools fabricating and misrecording medical information before broad adoption was already underway.

Why this fits the Graveyard

This incident fits under health, product failure, and public-sector automation because the systems were not random chatbots being tested for fun. They were approved tools for a clinical documentation workflow. The audit showed they could fabricate treatment steps, swap medications, and omit mental-health details in the exact task they were being procured to perform.

It also fits as a procurement failure. Accuracy received a tiny share of the scoring weight. Vendors could clear the process despite weak scores or missing risk documents in areas that should matter for health data. The deeper problem was less the AI mistakes themselves than a purchasing process that had room for them and still produced an approved vendor list.

The better version of this program is not mysterious. Put accuracy and safety near the center of scoring. Require minimum thresholds for clinical-note quality, security, privacy, and bias controls. Test with realistic clinical scenarios. Require an in-product sign-off before generated notes enter the record. Monitor outputs after deployment. Treat hallucination as an expected failure mode, not as a shocking exception that everyone discovers only after a watchdog reads the comments.

AI scribes may still be useful. Clinicians are buried in paperwork, and documentation burden is a real problem. But a useful assistant that writes wrong things in a medical chart is also a liability generator with a dictation feature. Ontario's audit caught that tension early enough to be useful. The procurement system now has to act like the warning meant something.

Vibe Graveyard

Ontario's approved AI scribes fabricated medical notes in audit testing

Incident Details

Tech Stack

References

Procurement testing did the thing testing is for

Accuracy was treated like a side dish

A manual-review escape hatch

Why this fits the Graveyard

Discussion