Study finds Google's AI Overviews wrong millions of times per hour

Tombstone icon

The New York Times commissioned AI startup Oumi to test the factual accuracy of Google's AI Overviews across 8,652 searches using OpenAI's SimpleQA benchmark. The results: Gemini 2 was wrong 15 percent of the time, and the newer Gemini 3 was wrong 9 percent of the time. Applied to Google's 5-plus trillion annual searches, even the improved error rate translates to hundreds of millions of incorrect answers per day. Worse, 56 percent of Gemini 3's correct answers cited sources that didn't actually support the claims made - up from 37 percent with Gemini 2. Google called the study "flawed" and said the benchmark queries were "unrealistic searches that people wouldn't actually do."

Incident Details

Severity:Facepalm
Company:Google
Perpetrator:Search Product
Incident Date:
Blast Radius:Over 1.5 billion monthly AI Overview users served incorrect information at scale; cited sources frequently don't support the answers presented.

The Study

On April 7, 2026, The New York Times published an investigation into the accuracy of Google's AI Overviews - the AI-generated summaries that now sit above traditional search results for a substantial share of all Google queries. The Times commissioned Oumi, a Seattle-based AI startup founded by former Google and Microsoft engineers, to measure the feature's factual reliability using SimpleQA, a benchmark developed by OpenAI for evaluating AI factual accuracy.

Oumi ran two rounds of testing. The first, in October 2025, evaluated AI Overviews powered by Gemini 2. The second, in February 2026, tested the upgraded Gemini 3. Each round covered 4,326 Google searches - 8,652 total. Oumi used its own verification model, HallOumi, to check each response against verifiable facts at scale.

This wasn't the first investigation into AI Overviews getting things wrong. The site you're reading already has entries for the glue-on-pizza and eat-rocks incidents of May 2024, and The Guardian's January 2026 investigation into dangerous health misinformation. What the Oumi study added was something different: a quantitative measurement of the feature's baseline error rate, applied to the volume of queries Google handles every day.

The Numbers

Gemini 2 produced correct answers 85 percent of the time. Gemini 3 improved to 91 percent. Google was happy to point out the improvement.

But the study's more interesting finding was about grounding - whether the sources AI Overviews cited actually supported the answers they gave. With Gemini 2, 37 percent of correct answers were "ungrounded," meaning the linked web pages didn't fully back up the information in the AI Overview. With Gemini 3, that number climbed to 56 percent. The newer model got better at producing correct answers and simultaneously got worse at showing its work. More than half the time, even when Gemini 3 was right, a user who clicked through to verify the answer would find sources that didn't say what the AI Overview claimed they said.

An AI search feature that answers correctly but can't point you to where it got the answer is a reference librarian who gives you the right book title and the wrong shelf number. You might end up with the right information, but you can't confirm it, and you can't trust the process that got you there.

What 9 Percent Means at Scale

Google processes over 5 trillion searches per year. Estimates of how many queries trigger an AI Overview range from 25 to 48 percent depending on the source and the measurement period. The independent blog Algorythmic worked through the arithmetic: 14 billion daily searches, a 20 percent AI Overview trigger rate, and a 9 percent error rate yields roughly 252 million incorrect answers per day. Tens of millions per hour.

Other analyses using higher trigger-rate estimates arrived at even larger figures. The exact number depends on assumptions that are hard to pin down from outside Google, but the order of magnitude is consistent across every calculation: AI Overviews are producing wrong answers at a volume that dwarfs the total daily output of most news organizations. Some outlets framed this as "misinformation at a scale possibly unprecedented in the history of human civilization." That's a headline-writer's flourish, but the underlying math is real.

A 91 percent accuracy rate sounds respectable in a research paper. Applied to the world's dominant search engine serving billions of queries per day, it means wrong answers are being delivered to users constantly, mixed in with correct ones, formatted identically, and presented with no distinguishing markers.

The Errors

The Times and Oumi cataloged specific failures. Asked when Bob Marley's home was converted into a museum, AI Overviews answered 1987; the correct year was 1986. The AI cited three web pages: two didn't mention a date at all, and the third (Wikipedia) listed two contradictory years. Gemini selected the wrong one and presented it as fact.

When asked about cellist Yo-Yo Ma's induction into the Classical Music Hall of Fame, the system linked to the organization's website - which listed Ma among 165 inductees since 1998 - and then stated there was no record of his induction. It contradicted its own cited source within a single answer.

Other errors included a misidentified river in North Carolina, incorrect details about an Air India crash, and a wrong death year for a former MLB pitcher. None came with hedging or uncertainty markers. All appeared in the same authoritative format as the correct answers around them.

The study also flagged the quality of sources that AI Overviews relied on. Facebook was the second most-cited source overall. Reddit was the fourth. Inaccurate answers cited Facebook 7 percent of the time, compared to 5 percent for accurate ones. The gap was small, but the fact that a feature positioned as a replacement for traditional search results relies heavily on social media posts as source material is worth noting.

Google's Response

Google spokesperson Ned Adriance described the study as having "serious holes." His objections: SimpleQA "is an old benchmark that is known for being full of errors"; Google DeepMind research had found incorrect ground truths in the benchmark itself; the test queries were "unrealistic searches that people wouldn't actually do"; and using one AI model (HallOumi) to evaluate another (Gemini) was methodologically questionable.

Some of these points landed. SimpleQA does contain errors - OpenAI has acknowledged as much. The benchmark's short, fact-seeking questions don't perfectly mirror the complex queries people actually type into Google. Whether a 9 percent error rate on SimpleQA overstates or understates the real-world rate is genuinely uncertain.

But Google's response didn't address the grounding finding - the 56 percent of correct answers where the cited sources didn't support the information presented. That's a quality problem that exists independent of whatever benchmark you use to measure accuracy. Even if every AI Overview answer were factually correct, a feature that routinely points users to sources that don't say what the feature claims they say has a credibility problem that no benchmark dispute can explain away.

Scale vs. Accuracy

AI Overviews launched in the U.S. in May 2024 as a rebrand of Google's Search Generative Experience. By 2026, the feature had expanded to over 200 countries and 40-plus languages, reaching more than 1.5 billion monthly users. It sits above traditional search results on the page, which means it's the first thing most users see. Organic click-through rates for queries with AI Overviews have dropped by as much as 61 percent according to one analysis - users read the AI summary and don't scroll further.

Google's strategic bet is straightforward. AI Overviews are the company's primary response to competitors like Perplexity and ChatGPT that are pulling search-like queries away from Google. Scaling the feature back would mean ceding ground. Keeping it at its current accuracy means serving a firehose of wrong answers to users who overwhelmingly trust them - a Wharton School study found users follow AI guidance roughly 80 percent of the time, even when it's incorrect, and separate research shows only 8 percent of users bother to fact-check AI-generated answers.

The result is a system where wrong information encounters almost no friction between generation and belief. A 9 percent error rate might be acceptable for a tool with a beta label and a disclaimer. For a feature that has quietly replaced the default search experience for over a billion people, the question is whether "right nine times out of ten" is good enough when the tenth time is indistinguishable from the other nine.

Discussion