BBC/EBU study says AI news summaries fail ~half the time

The Study

The European Broadcasting Union (EBU) coordinated and the BBC led the largest study to date on how AI assistants handle news content. Twenty-two public service media organizations from 18 countries participated, including DW (Germany), NPR (United States), and broadcasters across Europe. Journalists from these organizations evaluated 3,000 AI-generated responses to news-related questions posed in 14 languages.

The four AI assistants tested were ChatGPT (OpenAI), Copilot (Microsoft), Gemini (Google), and Perplexity AI. Each was asked the same news questions and the responses were evaluated for accuracy, sourcing quality, and ability to distinguish fact from opinion.

NPR's participation required a specific concession: the organization temporarily stopped blocking AI bots from accessing its content for approximately two weeks to collect the data, then re-enabled its content blocks afterward. That a news organization had to deliberately lower its defenses against AI scraping just to study how badly the AI tools misrepresented its content says something about the state of the relationship between news publishers and AI companies.

The Numbers

Forty-five percent of all AI-generated answers to news questions contained at least one significant issue with accuracy or sourcing. A significant issue meant a factual error, a fabricated claim, a missing or incorrect attribution, or a meaningful distortion of the source material.

When the researchers expanded the criteria to include less severe issues, 81 percent of responses had some form of factual or sourcing problem. An AI assistant returning a news summary with no issues at all was the exception, not the norm.

Thirty-one percent of responses showed serious sourcing problems - missing, misleading, or incorrect attributions. The AI tools would present information as if it came from specific news organizations, provide links to articles that didn't contain the cited claims, or strip attribution entirely while using a news outlet's reporting as the basis for the response.

Google Gemini was the worst performer, with a 76 percent error rate in its news responses according to detailed reporting on the study. The other three assistants also showed high error rates, but Gemini's was consistently the highest across languages and territories.

The BBC had conducted an earlier standalone study that provided a preview of these results. That study found that more than half of AI answers had significant issues, and nearly one-fifth of responses that cited BBC content as a source introduced factual errors that weren't present in the original BBC reporting. The AI tools were not just selecting and summarizing the BBC's articles. They were adding their own errors to the BBC's journalism and presenting the result under the BBC's name.

What Went Wrong

The errors fell into several categories. The most common was fabrication - the AI assistants generated factual claims that didn't appear in any source material. A chatbot asked about a policy development would include specific details (dates, numbers, quotes) that the original reporting didn't contain. These additions weren't marked as uncertain or unverified. They appeared alongside accurate information in the same tone and format, making them indistinguishable from the real facts without checking the original sources.

Sourcing failures were the second major category. AI assistants would attribute claims to specific news organizations that hadn't made those claims. They would provide links that led to articles about different topics, or to pages that no longer existed. They would present a synthesis of multiple sources as if it came from a single outlet. In some cases, the AI would generate a complete citation - outlet name, article title, date - for an article that had never been published.

The tools also struggled with recency. News questions about ongoing events received answers that mixed current information with outdated facts, presenting superseded claims alongside current ones without distinguishing between them. A user asking about a developing story might receive a response that combined yesterday's facts with last month's, with no indication that the situation had changed.

Opinion-fact confusion was the third category. The AI tools would present opinion pieces and editorial arguments as factual reporting, or present disputed claims as settled facts. The distinction between a news organization reporting that something happened and a columnist arguing that something should happen was lost in the AI's summarization.

Consistency Across Languages

One of the study's most significant findings was that the error rates were consistent across all 14 languages tested. This wasn't a language-specific problem where AI performed well in English but poorly in smaller languages. The 45 percent error rate held regardless of whether the query was in English, French, German, or any of the other languages in the study.

This consistency suggests the errors are structural - built into how the models process and generate text about news events - rather than artifacts of training data imbalance. Models trained on more English text than Finnish text might be expected to perform better in English. Their roughly equal poor performance across all languages points to a more fundamental limitation in how these systems handle factual reporting.

The Audience Problem

The Reuters Institute's Digital News Report 2025 found that 7 percent of total online news consumers already use AI assistants as a primary source for news. Among users under 25, that figure was 15 percent. These numbers were growing.

For public service broadcasters, this created a two-front problem. On one front, AI companies were using their content - often without payment or permission - to generate responses to user queries. On the other front, the AI-generated responses were distorting that content, introducing errors, stripping attribution, and presenting the result as if it were equivalent to (or a substitute for) reading the original reporting.

The users reading the AI summaries had no practical way to assess the quality of what they were getting. The summaries appeared authoritative, were formatted cleanly, and often included source citations (even if those citations were inaccurate). A user who asked ChatGPT about a news event and received a response citing the BBC had no reason to doubt the summary - unless they went to the BBC and read the actual article, which is the behavior the AI assistant was designed to replace.

The BBC and EBU research team released a "News Integrity in AI Assistants Toolkit" alongside the study. The toolkit addressed two questions: "What makes a good AI assistant response to a news question?" and "What are the problems that need to be fixed?" It was intended as a resource for technology companies, media organizations, researchers, and the general public - a framework for defining what acceptable AI news summarization looks like.

The Business Conflict

The study quantified a tension that news publishers had been articulating for years. AI companies scrape news content, use it to train models and generate responses, and then present those responses to users who might otherwise have visited the news organizations' websites. The AI companies monetize the interaction through advertising or subscription revenue. The news organizations whose content powers the responses receive nothing.

If the AI summaries were accurate and well-sourced, this would be a straightforward intellectual property and commercial dispute. That the summaries are wrong nearly half the time adds a dimension: the AI companies are not just using news content without compensation. They're distorting it. And they're distorting it under the news organizations' names, potentially damaging the audience's trust in the news brands that the AI claims to be citing.

What the AI Companies Said

At the time of the study's publication, the AI companies' standard response to accuracy concerns was that they were continuously working to improve their models and that users should verify important information from original sources. This advice - that users should check the AI's work by reading the original articles - acknowledges the problem while putting the burden on individual users to catch the errors.

For a user who turned to an AI assistant specifically to avoid reading full articles, the instruction to "verify important information" means the AI assistant didn't actually save them any time. They got an answer, they can't trust it, and they need to do the work the AI was supposed to do for them.

The study's contribution was making the scale of the problem visible. Individual instances of AI getting news stories wrong had been documented before. The BBC/EBU study showed that the problem wasn't anecdotal. It was systematic, it affected all four major AI assistants, it persisted across all languages tested, and it affected nearly half of all news-related queries. The AI tools were not occasionally getting news wrong. They were getting it wrong at a rate that would be unacceptable in any other medium that presents itself as an information source.

Vibe Graveyard

BBC/EBU study says AI news summaries fail ~half the time

Incident Details

Tech Stack

References