NewsBench says major chatbots failed election answers on facts, sourcing, or neutrality 90% of the time

Inside the audit

Forum AI launched NewsBench on May 21, 2026, as an independent benchmark for how major chatbots handle news, politics, and current events. The first release tested four widely used systems: OpenAI's ChatGPT, Google's Gemini, Anthropic's Claude, and xAI's Grok.

The scale was large enough to be worth taking seriously. Forum evaluated 3,136 prompts and 12,542 responses across politics, foreign affairs, the economy, healthcare, education, and consumer questions. The white paper describes an expert-grounded pipeline: senior practitioners helped define editorial standards, domain experts built gold labels, and calibrated AI judges applied those standards at scale. The three scoring dimensions were factuality, neutrality, and source quality.

This matters because news is not a static trivia set. Election procedures change; candidates die, drop out, or get replaced; court rulings move; agency rules get updated. A chatbot that sounds excellent on old facts can become a civic Roomba with a law degree as soon as the topic depends on yesterday's update.

NewsBench was designed around that problem. Its prompt set mixes real-world prompts, expert-generated edge cases, and synthetic prompts, with a current-events portion refreshed regularly. The white paper says the final prompt set included 1,500 real-world prompts, 750 expert-generated prompts, and 885 synthetic prompts. Forum also says roughly 500 prompts are refreshed monthly to keep the benchmark attached to moving news rather than fossilized benchmark lore.

That election number

The sharp headline is the election finding. Forum reported that, on prompts about the upcoming U.S. midterms, the four chatbots failed on accuracy, neutrality, or source selection 90% of the time.

That does not mean 90% of election answers were pure factual hallucinations. The combined failure rate includes bad facts, political lean, and poor source choices. That distinction matters. A model can answer without inventing a date and still fail if it smuggles partisan framing into the answer or cites junk sources as if they were reliable.

The factual-error number is still bad. Bloomberg coverage syndicated by The Star reported that nearly 36% of election answers contained at least one factual error. Grok was the worst by that measure, with errors in nearly 52% of election answers. Forum's own launch post framed the broader accuracy problem similarly: about 30% of all responses in the dataset contained at least one verifiable factual error, and about one in three voting-relevant responses ahead of the midterms contained errors.

The examples were not abstract benchmark dust. Forum said Gemini overstated 2026 Arkansas ACA premium increases by claiming increases around 65% to 67% when the approved weighted average increase was about 22%. Grok described Iranian military capability as more thoroughly erased than public reporting supported. Claude attributed campaign-strategy quotes to Representative Raul Grijalva even though Forum says NPR attributed them to Adelita Grijalva. These are the small, polished errors that make a user nod along until the damage is already in the notes.

Citations with a straight face

Source quality was a separate mess. Forum found that about 15% of all responses cited at least one state-controlled foreign media outlet. On foreign-policy prompts, that rose to 35%. The Star's Bloomberg-sourced report said ChatGPT and Grok were the worst offenders in that category, citing state-owned media in 51% and 44% of foreign-policy answers respectively.

State media is not automatically forbidden as context. Sometimes a question is specifically about what a government-controlled outlet said. The failure is treating that outlet as an ordinary neutral source for contested public-policy questions, then wrapping the answer in the same calm citation style used for reputable journalism, public records, or peer-reviewed research.

Forum listed examples that show why this is dangerous. ChatGPT cited Global Times on the Uyghur genocide. Grok cited CGTN America on insider trading by U.S. senators. ChatGPT cited People's Daily Online on whether American power is waning and RT on why the U.S. political left criticizes Donald Trump. Calling that source diversity gives the retrieval system too much credit. It walked into the propaganda aisle and came back with a receipt.

Commercial sources were common too. Forum said more than 45% of responses cited at least one commercial source, with Grok at 74% and ChatGPT at 56%. Commercial material is not useless, but an ammo retailer's blog is a strange place to ground an answer about liberal views on gun regulation when Pew, Gallup, Johns Hopkins, Quinnipiac, and primary legal sources exist.

Why this earns its own entry

This story sits near the Demos Scottish election study already in the graveyard, but it is not the same incident wearing a different hat. Demos tested five AI services on Scottish Parliament election questions and found factual errors in 34.1% of factual responses. NewsBench is broader: four frontier chatbots, thousands of news and current-events prompts, a U.S. midterm focus, and an explicit split between factuality, neutrality, and source quality.

That split is useful. A lot of AI accuracy arguments collapse everything into "was the answer true?" NewsBench makes a more uncomfortable point. For civic information, truth is only one part of the job. A useful answer also needs to avoid partisan framing and use sources that make sense for the question.

The white paper's model table shows the tradeoffs. ChatGPT had the best factuality score, with 91.1% of responses passing factuality, but it scored lower than Claude and Gemini on source quality. Claude had the highest source-quality score, yet only 58.9% of its responses passed factuality. Grok landed at the bottom on neutrality and factuality, with 57.1% of responses passing factuality. One model can cite prettier sources while still making false claims, and another can be relatively more factual while sourcing worse material.

Users struggle to detect that failure. A chatbot answer with citations looks more trustworthy than a naked paragraph. If the citations are weak, politically loaded, or used to support a claim they do not actually prove, the polish becomes part of the risk.

Guardrails have to be boring

No one needs a chatbot to improvise election procedure. For voter eligibility, registration deadlines, polling locations, ballot rules, and candidate lists, the correct behavior is dull: route to official sources, show the retrieval date, name the jurisdiction, and refuse when the system cannot verify current facts.

For political analysis, the fix is accurate sourcing, clear attribution, and separation between facts, disputed claims, and advocacy, rather than forced mushy neutrality. If a response cites state-controlled media, label it. If a source is a trade group, retailer, campaign, or advocacy shop, label that too. If a source does not support the claim, the claim should not appear.

That restraint makes the product feel less magical. Fine. Election information is not the place for magic but for boring procedures that keep people from being told the wrong date, the wrong rule, or the wrong source.

NewsBench does not prove every chatbot answer about politics is broken. It does show that the major systems still fail in patterns that matter: stale facts, false claims, tilted framing, and source selection that can launder weak evidence into confident answers. A user asking about an election should not need to run a newsroom verification desk to understand whether the bot's answer is usable.

The awkward part for AI companies is that the product pitch keeps drifting toward "ask us anything." NewsBench is evidence that "anything" currently includes questions the systems cannot handle reliably enough. The answers arrive formatted and confident; the civic reliability still looks like it wandered in late carrying a half-read briefing memo.

Vibe Graveyard

NewsBench says major chatbots failed election answers on facts, sourcing, or neutrality 90% of the time

Incident Details

Tech Stack

References

Inside the audit

That election number

Citations with a straight face

Why this earns its own entry

Guardrails have to be boring

Discussion