Amazon's retail site hit by wave of AI-code outages, losing millions of orders

The Pattern

In early March 2026, Amazon's main e-commerce website experienced a series of outages that the company's own internal documents linked to "Gen-AI assisted changes" and "novel GenAI usage for which best practices and safeguards are not yet fully established."

That language - "best practices and safeguards are not yet fully established" - is a remarkably diplomatic way to describe a situation in which AI-generated code changes caused one of the world's largest online retailers to lose millions of customer orders across multiple incidents in the span of a single week.

The March 5 Incident

The most severe disruption hit on March 5, 2026. Amazon's website went down for approximately six hours, preventing customers from completing transactions, viewing account details, or accessing product pages. The outage was traced to an erroneous code deployment.

The impact numbers are staggering. According to Business Insider, the March 5 incident caused a 99% drop in orders across Amazon's North American marketplaces. That's not a typo. Ninety-nine percent. The estimated total: 6.3 million lost orders in a single day.

For a company that processes an estimated 1.6 million packages daily, a six-hour window in which virtually no orders were being placed represents a financial hit that likely runs into the hundreds of millions of dollars - not counting the downstream effects on third-party sellers, delivery logistics, and customer trust.

The March 2 Incident

Three days earlier, on March 2, a separate incident caused Amazon to display incorrect delivery times to customers. This one didn't take the site fully offline, but it resulted in approximately 1.6 million errors and an estimated 120,000 lost orders globally. Wrong delivery estimates don't sound catastrophic until you realize that customers make purchasing decisions based on delivery speed - and that "promising something you can't deliver" is the kind of customer experience failure that erodes trust incrementally.

The Internal Response

Amazon's response to the pattern of outages was, by corporate standards, unusually aggressive. The company convened an emergency internal "deep dive" meeting to investigate the root causes. Internal documents from that investigation identified the outages as stemming from AI-assisted code changes - specifically, from the use of generative AI tools in ways that the company's existing review processes hadn't caught.

Three major changes came out of the investigation.

First, Amazon initiated a 90-day "code safety reset" covering approximately 335 "Tier-1" systems - the critical retail infrastructure that directly handles ordering, payments, and customer-facing experiences. The stated goal is to introduce "controlled friction" into the deployment process and strengthen long-term safeguards. "Controlled friction" is the polite engineering term for "slowing things down because going fast was breaking the site."

Second, the company now requires senior engineers to sign off on all AI-assisted code changes made by junior and mid-level engineers. This is a significant process change for a company that famously pushes engineering velocity and small-team autonomy. Requiring a senior review checkpoint for an entire class of code changes is an implicit admission that AI-generated code needs more scrutiny than human-written code - a conclusion that's consistent with the CodeRabbit study that found AI-generated code produces 2.74 times more security vulnerabilities.

Third, Amazon mandated more extensive documentation before deploying critical code changes, adding another layer of review to the process.

Amazon's Defense

Amazon has disputed the characterization that AI tools were the primary cause of the outage wave. The company's public position is that only one of the recent incidents actually involved AI tools, and that the issue in that case was "user error" - an engineering team's misconfiguration - rather than flawed AI output.

This is the same rhetorical playbook Amazon used after the December 2025 Kiro incident, when the company's own AI coding agent deleted and recreated an AWS environment, causing a 13-hour outage. In that case, too, Amazon argued the root cause was "misconfigured access controls - not AI." And in that case, too, Amazon simultaneously implemented new mandatory review processes specifically for AI-assisted changes.

There's a pattern within the pattern. The company says the AI isn't the problem, then implements controls specifically designed to catch AI-related problems.

The Productivity Paradox Hits Home

Amazon has been aggressively pushing AI coding tool adoption internally. Engineers have reported internal pressure to use AI assistants, with employee accounts describing situations where AI-generated code was deployed with insufficient review. Some internal descriptions paint a picture that borders on absurdist: "on-calls using AIs to fight each other's AIs in a proxy war of blame" - a situation where AI tools deployed by different teams generate conflicting changes, and the humans end up troubleshooting the AI's work rather than doing their own.

The promise of AI coding tools has always been productivity. Write code faster, ship features sooner, do more with fewer engineers. Amazon's March 2026 experience is a live stress test of that promise at a company with the engineering resources and infrastructure to make AI-assisted development work if anyone can.

The results suggest that the productivity gains are real - AI coding tools do generate code faster - but the failure modes are different from human-written code, and existing review processes weren't designed to catch them. The 90-day code safety reset is Amazon essentially admitting that the company shipped AI-generated code into production faster than its quality controls could keep up with.

The Numbers in Context

To appreciate the scale of these outages, consider that Amazon's global e-commerce operation generated approximately $575 billion in revenue in 2025. A six-hour outage with a 99% drop in orders is not just an engineering incident; it's a financial event. A back-of-napkin calculation based on Amazon's average daily revenue suggests the March 5 outage alone may have cost the company somewhere north of $100 million in lost sales, though an unknown portion of those orders were likely recovered when the site came back online.

The third-party seller ecosystem adds another dimension. Millions of small businesses depend on Amazon's marketplace for their revenue. A six-hour outage for Amazon is a six-hour outage for every seller on the platform - and unlike Amazon, most of those sellers don't have the cash reserves to absorb an unexpected day of zero sales.

The Bigger Question

Amazon's March 2026 outage wave is significant not just for what it says about Amazon, but for what it says about the state of AI-assisted development across the industry. If Amazon - with its engineering depth, its own AI tools, its scale-tested infrastructure - can't manage the transition to AI-assisted development without suffering a series of outages that cost millions of orders, what does that imply for every other company adopting the same tools with fewer resources and less rigorous processes?

The 90-day code safety reset will likely stabilize Amazon's deployment pipeline. The more interesting question is what comes after. Amazon isn't going to stop using AI coding tools. The financial incentive to automate code generation is too large, and the competitive pressure to ship features faster is too intense. The question is whether the guardrails they're building during this reset - senior review requirements, controlled friction, stricter documentation - will be sufficient when the reset period ends.

History suggests they'll be sufficient until the next time they aren't. The December 2025 Kiro incident led to mandatory peer review. The March 2026 outages led to a 90-day code safety reset. Each incident produces new controls, and each period of stability produces new pressure to move faster. It's a cycle that's likely to repeat, because the tension between AI-assisted productivity and production reliability isn't a problem you solve once. It's a tension you manage continuously, and Amazon is learning how to manage it in public, at a cost measured in millions of lost customer orders.

Vibe Graveyard