Canada's $18M tax chatbot gave correct answers a third of the time

Filing taxes isn't anyone's idea of a good time, but most people at least expect that the government's official support channels will give them accurate information about how to do it. The Canada Revenue Agency's AI chatbot, affectionately named "Charlie," had a different interpretation of the assignment.

The audit

In October 2025, Canada's Auditor General Karen Hogan released a report examining how well the CRA was serving taxpayers through its various support channels. The findings on Charlie were, to use a technical term, not great.

The Auditor General's team tested Charlie with six tax-related questions - standard queries that a taxpayer might reasonably ask the CRA's official support system. Charlie answered two of them correctly. That's a 33% accuracy rate on questions about the tax system it was specifically built to handle.

For context: the Auditor General's team also posed the same questions to publicly available AI tools - general-purpose systems with no special access to CRA data or tax rules. Those tools answered five out of six correctly. A chatbot purpose-built for Canadian tax questions, running on $18 million in taxpayer funding, was outperformed by tools that weren't designed for the task at all.

The CRA's internal math

The CRA did not share the Auditor General's assessment of Charlie's performance. The agency internally reported that Charlie met a 70% accuracy threshold - more than twice the figure the Auditor General's testing produced. This gap is telling but not unusual in the world of AI chatbot deployments. Internal testing often uses carefully constructed question sets that match the chatbot's training data, while external testing uses the messier, more varied questions that actual humans ask.

The CRA also pointed to a newer version of Charlie that reportedly achieved approximately 90% accuracy in internal testing. That version hadn't been released to the public, so the accuracy couldn't be independently verified. Whatever Charlie could theoretically do in a lab, the version millions of Canadians were actually interacting with was the 33% one.

The $18 million price tag

Charlie launched in 2020, which means Canadian taxpayers had been funding it for five years by the time the audit landed. The total investment exceeded $18 million. For that budget, the CRA got a chatbot that the Auditor General described as providing responses that were "often brief, lacking sufficient context or additional information."

That's a diplomatically devastating sentence. "Brief" and "lacking sufficient context" is auditor-speak for "gives answers that technically contain words related to the question but don't actually help the person asking."

Tax guidance is not a domain where approximate answers are acceptable. If Charlie tells a taxpayer they don't need to report a certain type of income, or that they qualify for a deduction they don't, the taxpayer follows that guidance in good faith and faces potential penalties later. The CRA's own chatbot potentially creating compliance problems for the taxpayers it was supposed to help is a particular kind of institutional failure.

The human baseline comparison

The audit contained a detail that makes the Charlie situation simultaneously worse and more complicated: the CRA's human call-center agents got personal income tax questions right fewer than one in five times - below 20% accuracy.

This puts Charlie's 33% accuracy rate in a genuinely awkward position. The chatbot was bad. The humans were worse. Canada's entire tax support apparatus - both the AI and the human staff - was providing incorrect information to taxpayers at rates that would be alarming in any context, let alone one involving legal obligations and potential financial penalties.

The CRA's response to the audit acknowledged the accuracy problems and indicated it was exploring "further integration of AI and improved training for call center staff." The plan, in other words, was to fix the bad chatbot by using more AI and to fix the undertrained staff by training them better. Whether either intervention would address the root cause - which appeared to be a systemic disconnect between the support systems and the actual tax rules they were supposed to explain - was left as an exercise for the future.

The broader pattern

Charlie isn't the only government chatbot to struggle with accuracy. New York City's MyCity chatbot advised businesses that they could legally discriminate against tenants and force employees to share tips - both violations of existing law. Several California community colleges deployed AI chatbots that gave students incorrect information about financial aid and enrollment. Government agencies seem particularly susceptible to the "deploy the chatbot, celebrate the launch, check the accuracy later" pattern.

The CRA situation is distinctive because of the scale (millions of taxpayers), the stakes (legal tax obligations), the cost ($18M+), and the duration (five years of operation before an independent accuracy check revealed the problem). Charlie wasn't a pilot program that went wrong. It was a production system that ran for half a decade before anyone outside the CRA systematically tested whether it was telling people the right things.

What happens when the tax chatbot is wrong

Tax advice errors have consequences that extend well beyond a frustrating customer service interaction. A taxpayer who follows incorrect CRA guidance could file their return incorrectly, miss a deduction they were entitled to, claim a credit they weren't eligible for, or fail to report income they were legally required to declare. Any of these could result in assessments, penalties, or interest charges.

The CRA's standard position on taxpayer errors is that the taxpayer is ultimately responsible for the accuracy of their return, regardless of what guidance they received. Whether that position holds up if the incorrect guidance came from the CRA's own official chatbot is an open question, but Canadian tax law doesn't generally provide a defense of "the government's chatbot said it was fine."

The Auditor General's report didn't quantify how many taxpayers may have received incorrect guidance from Charlie over its five-year operational history, or what the downstream effects were. Given that the chatbot handled an unspecified but presumably large volume of interactions over that period, and given that it was wrong more often than it was right, the cumulative impact is hard to estimate but difficult to dismiss.

The accuracy measurement problem

One of the audit's more useful contributions was exposing the gap between the CRA's internal accuracy claims and independent reality-testing. The CRA reported 70% accuracy. The Auditor General measured 33%. That discrepancy isn't just a rounding error. It suggests entirely different approaches to defining and measuring what "correct" means for a chatbot.

Internal testing often evaluates whether the chatbot produced a response that was technically related to the question. External testing evaluates whether the response actually answered what the person was asking, with enough accuracy and context to be useful. The difference between "the chatbot said something about capital gains" and "the chatbot correctly explained how capital gains apply to this specific situation" is enormous in practice.

The CRA's department plan for 2025-26 outlined expanded use of machine learning and AI across its operations, including compliance detection. Whether the agency applies more rigorous accuracy standards to its next round of AI deployments than it did to Charlie remains to be seen, but the Auditor General's report made the baseline clear: the existing approach produced a chatbot that was wrong two-thirds of the time, cost $18 million, and ran for five years before anyone checked.

Vibe Graveyard