Gemini paused people images after historical inaccuracies

Tombstone icon

Google paused Gemini's image generation of people on February 22, 2024, after users discovered the tool was producing historically inaccurate depictions - including racially diverse World War II German soldiers, Black female popes, and multiethnic U.S. Founding Fathers. The overcorrection stemmed from diversity tuning meant to counter training-data biases, but the model failed to distinguish when diversity adjustments were inappropriate for specific historical prompts. CEO Sundar Pichai called the outputs "completely unacceptable." Google SVP Prabhakar Raghavan later published a blog post acknowledging the model had "overcompensated" and been "over-conservative."

Incident Details

Severity:Facepalm
Company:Google
Perpetrator:AI Product
Incident Date:
Blast Radius:Feature paused; trust hit; policy and model adjustments.

The Launch

Google had rolled out image generation capabilities in Gemini earlier in February 2024. The feature let users type text prompts and receive AI-generated images, similar to tools like DALL-E and Midjourney. It was part of Google's push to make Gemini a competitive multimodal AI product after the rocky debut of Bard the year before.

For a few weeks, things seemed fine. Then users started sharing their results on social media.

The Images

The problems surfaced on X (formerly Twitter) in the third week of February 2024. Users began posting screenshots showing Gemini producing images that were historically wrong in conspicuous ways. When prompted to generate images of World War II German soldiers, Gemini returned pictures of racially diverse soldiers in Wehrmacht or SS uniforms - including Black and Asian individuals in roles that did not reflect the actual composition of Nazi Germany's military forces. When asked to depict the Founding Fathers of the United States, the model returned images of men and women from various ethnic backgrounds, which did not match the historical reality that the Founding Fathers were white men.

Other prompts produced similarly off-base results. A request for an image of a pope generated an image that appeared to show a Black woman wearing papal vestments. While there may have been two or three popes of African descent in the early centuries of the Church (the last ending service around 496 AD), and no verified female pope in the Vatican's official history, Gemini was generating these images without any contextual grounding.

At the same time, users reported that Gemini was declining to generate images of white people when asked directly, or that it would awkwardly inject diversity into prompts where it made no sense. The behavior was inconsistent - sometimes overly cautious, sometimes overcorrecting in bizarre directions.

What Went Wrong

Google had been aware for years that AI image generators carry biases inherited from their training data. OpenAI had faced similar criticism with DALL-E in 2022, when the model defaulted to strongly gendered and racially stereotyped outputs. A prompt for "builder" produced exclusively male images; a prompt for "flight attendant" produced exclusively female images. The default representation reflected the skewed distribution of images and captions in the training data.

To counter this, Google's team had calibrated Gemini to produce more diverse image outputs. The system was designed to add diversity terms to prompts behind the scenes - a technique sometimes called prompt rewriting. When a user asked for "a picture of a doctor," the system might internally modify the prompt to produce a range of skin tones, genders, and ages in the output. Margaret Mitchell, a former Google AI ethics researcher, explained that AI models can be instructed to generate a larger set of images than the user actually sees, then rank them to surface more diverse results. Darker skin tones might be ranked higher to counterbalance training data that skews white.

The strategy started from a reasonable premise. Training data for image generation models is overwhelmingly biased toward light-skinned men in positions of authority, and uncorrected models reproduce those biases faithfully. Google's team understood that defaulting to historical biases would generate public backlash. The problem was in the implementation.

The diversity calibration was applied uniformly. The model did not have adequate logic to distinguish between "generate a picture of a doctor" (where diversity is appropriate and desirable) and "generate a picture of the Nazi Wehrmacht" (where injecting racial diversity produces outputs that are historically false and offensive). The system treated all prompts for people the same way, regardless of whether the prompt had a specific historical context that made the diversity adjustment inappropriate.

Prabhakar Raghavan, Google's Senior Vice President, later explained in a blog post that Gemini had been "calibrated to show a range of people" but had "not accounted for cases that should clearly not show a range." The model had simultaneously been set to be too cautious about some prompts, interpreting "some very anodyne prompts as sensitive" and refusing to generate images at all. These two tendencies - overcorrecting for diversity and over-refusing on sensitivity - combined to produce outputs that were, in Raghavan's words, "embarrassing and wrong."

The Social Media Reaction

The screenshots spread rapidly across X, and the reaction split along predictable political lines. Right-wing commentators and accounts seized on the images as evidence of what they characterized as "woke" Big Tech enforcing a diversity agenda. Elon Musk reposted a screenshot showing Gemini's chatbot telling a user that white people should acknowledge white privilege, calling the chatbot "racist and sexist." The controversy became a political flashpoint, with some accounts pushing unfounded conspiracy theories that Google was deliberately trying to erase white people from AI image results.

On the other side, AI researchers and diversity advocates pointed out the legitimate problem Gemini was trying to solve. Training data bias in AI image models is well-documented and produces real harm when models reproduce stereotypes. The question was whether Google's correction had overcorrected past the point of accuracy into the territory of generating false historical depictions. Most researchers agreed it had. Dave Willner, former head of trust and safety at OpenAI, told Platformer's Casey Newton that the approach "wasn't exactly elegant" and speculated that the missteps resulted at least partly from insufficient resources allocated to the engineers handling this nuanced work.

Google's Response

On February 22, 2024, Google announced it was pausing Gemini's ability to generate images of people entirely. The company acknowledged the outputs were inaccurate and said it would work to fix the feature before re-enabling it.

CEO Sundar Pichai went further, calling the generated images "completely unacceptable" in a memo to Google employees. Pichai said the company would conduct a structural review of what went wrong and why, and that teams would work to address the issues before the image generation feature for people was brought back.

Gemini Senior Director of Product Jack Krawczyk also posted on X, acknowledging the problems and promising corrections. The tone from Google's leadership was notably contrite compared to the company's typical product-issue responses. This wasn't a minor bug or an edge case - it was a flagship AI product generating images that managed to offend people across the political spectrum simultaneously.

Raghavan's blog post, titled "What happened with Gemini image generation," provided the most detailed technical explanation. He described the two failure modes - overcompensation on diversity and over-caution on sensitivity - and confirmed that the feature would remain paused until the issues were resolved.

The Bias Correction Problem

The Gemini image incident highlighted a genuine engineering challenge that no AI company has cleanly solved. Image generation models trained on internet-scale datasets absorb the biases present in those datasets. If most images of CEOs in the training data show white men, the model will default to generating white male CEOs. If most images of nurses show women, the model will default to generating female nurses. These defaults reinforce stereotypes.

Correcting these biases requires intervention, but the interventions are themselves fraught. Simple approaches - like randomly diversifying all outputs - produce the exact kind of failures Gemini exhibited. More nuanced approaches require the model to understand historical and cultural context at a level current systems don't reliably achieve. A model would need to know that "generate an image of a 1940s Wehrmacht soldier" calls for historical accuracy, while "generate an image of a modern software engineer" is a case where diversity is appropriate.

This kind of contextual reasoning is precisely what language and image models are bad at. They operate on statistical patterns, not historical knowledge. When the calibration system injects diversity terms into prompts behind the scenes, it does so based on rules, not understanding. The rules either apply too broadly (as in Gemini's case) or too narrowly (as in early DALL-E, which defaulted to stereotypes).

Aftermath

Google eventually restored image generation capabilities for Gemini, but the incident left marks on the company's AI credibility. Coming less than a year after the Bard stumble at launch in February 2023 - when the chatbot gave a factually incorrect answer about the James Webb Space Telescope in a promotional ad - the Gemini image controversy reinforced a pattern of Google's AI products launching with problems that should have been caught in testing.

The incident also became a recurring reference point in debates about AI safety and content moderation. It demonstrated that even well-intentioned safety measures can produce outputs that are factually wrong and publicly embarrassing. The problem was not that diversity calibration is unnecessary - the underlying bias problem in training data is real and well-documented. The problem was that applying corrections without sufficient contextual awareness creates new failure modes that can be just as harmful as the biases being corrected.

For Google, Gemini's image generation pause joined a growing list of AI product launches that had to be walked back or paused within weeks. The company's AI division was shipping features at a pace driven by competitive pressure from OpenAI, and the quality control wasn't keeping up.

Discussion