Washington Post launched AI podcast that failed its own quality tests at an 84% rate

The Product

"Your Personal Podcast" was the Washington Post's entry into AI-generated audio content. The product let users customize news podcasts by selecting topics, choosing a host style, and specifying episode length. The AI would then generate a script from Post journalism and produce an audio podcast. It was aimed at younger audiences and framed as an expansion of the Post's digital offerings.

The concept made sense on paper. Podcast listening is growing. Personalization is popular. The Post has a large archive of reporting. An AI that could repackage that reporting into customized audio summaries could, in theory, reach audiences who don't read longform journalism but might listen to a ten-minute podcast on their commute.

The reality was rather different from the theory.

The Testing

Before launch, the Post ran internal quality evaluations on the AI-generated scripts. These tests measured whether the output met the publication's editorial standards - accuracy, attribution, tone, factual fidelity. The results were evaluated across three separate rounds.

Semafor's Max Tani obtained the internal review and reported the findings on December 11, 2025: between 68% and 84% of scripts failed to meet quality standards. Not "needed minor tweaks." Failed. Across three rounds of testing, the best the AI could achieve was a 32% pass rate.

The review's conclusion was blunt: "Further small prompt changes are unlikely to meaningfully improve outcomes without introducing more risk."

This is the kind of internal assessment that typically precedes a project being shelved or substantially reworked. Instead, the product team recommended launching.

The Errors

The AI's failures were not subtle technical hiccups. The system fabricated quotes, attributing invented statements to real public figures and presenting them as genuine Post reporting. It misattributed real quotes, assigning them to the wrong sources. It mispronounced names. And it inserted editorial commentary, sometimes presenting a source's quoted opinion as the Post's own institutional position - a foundational journalism violation at any news organization.

One particularly damaging pattern: the AI would take quotes from sources in Post articles and reframe them as the Post's own editorial stance. For a news organization that depends on the distinction between reporting and opinion, having an AI product blur that line at scale was not a cosmetic problem. It was the kind of error that, if a human journalist made it repeatedly, would result in disciplinary action.

The Post's head of standards, Karen Pensiero, sent a message to staff acknowledging that the errors had been "frustrating for all of us." This was diplomatic phrasing.

The Internal Reaction

Not everyone at the Post was as measured as Pensiero. Semafor obtained Slack messages from Post staff that reflected something closer to outrage.

One editor wrote: "It is truly astonishing that this was allowed to go forward at all. Never would I have imagined that the Washington Post would deliberately warp its own journalism and then push these errors out to our audience at scale."

That message captures two distinct criticisms. The first is about the errors themselves - an AI product that fabricates quotes and misattributes reporting is actively damaging the journalism it's supposed to serve. The second is about the decision to launch anyway. The internal testing showed failure rates between 68% and 84%. The review concluded that prompt engineering couldn't fix the problem. And the product shipped regardless.

The Washington Post Guild, the newsroom's union, also weighed in, warning that the rollout threatened the newspaper's mission and standards. TheWrap, Futurism, The Daily Beast, and NPR all covered the internal revolt.

"This Is How Products Get Built"

The Post's institutional response was to frame the launch as normal product development. A spokesperson told Semafor: "This is how products get built and developed in the digital age." The company described the release as a "Beta" and said that features only graduate to full products "if they prove to be successful for the customer."

The "beta" framing is familiar from the tech industry, where it means "we shipped it knowing it has problems and we'll fix them in production." In software, this is a debatable but accepted practice. For a news organization whose core value proposition is factual accuracy, shipping a product that invents quotes and attributes them to real people while calling it "beta" is a harder sell.

News consumers don't generally check whether the podcast they're listening to under the Washington Post's name is a "beta." They hear Washington Post branding and assume the content meets Washington Post standards. That's what a brand means. When the content fabricates quotes from public figures, the damage accrues to the brand regardless of what Greek letter the product team assigned to the release.

Futurism reported that the Post said it would continue the AI podcast program despite the backlash. The company's position appeared to be that the problems were fixable and that stopping would mean giving up on the potential of the format.

The Timing

The launch came days after the White House created a website attacking individual journalists, with Post reporters among those targeted. Launching a product that generated inaccurate journalism under your own masthead while the administration was already questioning the credibility of your reporting was, from a public relations standpoint, not ideal timing.

Context

The Post is not the first news organization to learn this lesson. CNET's AI-written finance articles required corrections on 53% of the pieces published. Gannett's AI-generated sports articles were incoherent. Sports Illustrated's AI bylines turned out to be fake people. The pattern is consistent: news organizations deploy AI content generation, the AI produces errors at rates that would get human journalists fired, and the organization either pulls the feature or enters a protracted "we're improving it" phase.

What distinguishes the Post's case is the documentation. Most AI content failures are discovered after launch, when readers or outside journalists spot the errors. The Post's failure was discovered before launch, by the Post's own testing, with failure rates between 68% and 84%, and an internal review explicitly stating that the problems were structural rather than fixable through prompting. And the product launched anyway.

The internal review's conclusion bears repeating: "Further small prompt changes are unlikely to meaningfully improve outcomes without introducing more risk." The product team agreed with this assessment and shipped the product. That's not a technology failure. That's a decision.

Vibe Graveyard