METR study finds experienced developers were 19% slower with AI tools

A productivity study with teeth

METR's July 2025 developer productivity study was irritating in the most useful way: it measured actual task completion time instead of asking people whether the tool felt helpful.

The setup was narrow and unusually concrete. Sixteen experienced open-source developers worked on 246 real issues in mature repositories they already knew well. On average, the developers had about five years of prior experience with the projects they were modifying. Each issue was randomly assigned to one of two conditions: AI tools allowed, or generative AI assistance disallowed. When AI was allowed, developers mostly used Cursor Pro with Claude 3.5 Sonnet or Claude 3.7 Sonnet, which were frontier tools at the time.

Before the work began, the developers predicted AI would reduce completion time by 24%. After the study, they still believed AI had made them 20% faster.

The measured result disagreed. When AI tools were allowed, tasks took 19% longer.

That is the kind of finding that annoys everyone. AI skeptics are tempted to inflate it into "AI coding tools are useless." AI boosters are tempted to explain it away as a weird sample, unfamiliar tool usage, or an artifact of old models. METR did the more responsible thing and spent a lot of the paper explaining what the result does not prove.

It does not prove AI slows most developers or beginners, and it does not prove AI slows greenfield app work, toy projects, boilerplate generation, or domains outside software engineering. It specifically measured experienced developers working on real issues in large, familiar open-source projects during the February to June 2025 frontier-model window.

That is still a meaningful slice of reality, because plenty of professional software work looks much closer to "modify this mature codebase without breaking it" than to "generate a new demo app from a blank folder."

A perception gap

The most useful part of the study may be the mismatch between measured time and perceived time. The developers did not merely expect AI to help before the study. After living through the slower condition, they still thought it had helped.

That is a product-management hazard disguised as developer sentiment. If an organization measures AI impact mainly by surveys, it can get a positive signal while delivery slows down. Developers may feel less stuck, less bored, or less mentally taxed. Those benefits are real enough to matter. But they are not the same thing as finishing work faster.

METR found several plausible reasons for the slowdown. Developers spent time prompting, waiting for outputs, reviewing suggestions, and cleaning up code that was close enough to demand attention but wrong enough to require correction. TechRadar's coverage noted that only a minority of AI-generated suggestions were accepted, and developers spent nontrivial time correcting outputs. That is the productivity trap: bad code is easy to reject, but almost-right code makes a senior developer stop, read, simulate intent, and decide whether the model got the hidden requirements right.

In mature repositories, hidden requirements are everywhere. The right change may depend on a naming convention, a historical bug, a private assumption in the test suite, or a release process that nobody wrote down because the team learned it in production at 2:00 a.m. An AI assistant can be useful in that environment, but it does not automatically inherit the maintainer's scars.

What the study does not prove

The METR study should be treated carefully. Sixteen developers is a small sample. The participants worked on open-source projects, not a representative spread of corporate engineering work. The tools and models were early-2025 versions, and tool familiarity varied. Future models may do better. Teams that redesign their workflow around AI may get different results from developers using AI as an optional assistant inside existing habits.

The value of the study is not that it hands managers a universal conversion factor for AI coding; it is that it broke the lazy equation between "developers feel faster" and "delivery is faster." In this setting, the subjective readout pointed one way and the clock pointed the other.

That matters because a lot of AI adoption programs are built on sentiment, demo speed, usage dashboards, and lines of code. METR used a harsher instrument: how long did it take to finish the issue?

Why it belongs here

Vibe Graveyard does not need another opinion piece saying AI coding is good or bad. METR's study belongs because it documents a specific systemic failure pattern: teams can adopt AI coding tools, feel faster, and still move slower on real maintenance work.

The harm here is not dramatic enough for "Catastrophic." Nobody lost a database. Nobody shipped a chatbot that invented a refund policy. The blast radius is subtler: engineering leaders may mistake perceived speed for actual throughput and then restructure planning, staffing, and review expectations around a productivity gain that is not there.

That is how a local tool benefit turns into an organizational mess. Developers feel assisted. Managers see adoption. Roadmaps get more aggressive. Review queues fill up. The work still has to be understood, tested, integrated, and maintained by humans who now have an extra pile of plausible suggestions to inspect.

METR's result may age as models improve. It probably should age; tools this young should get better. But the measurement lesson will not age out. If the claim is "AI makes us faster," then the proof has to be elapsed time on real work, not a survey asking whether the tab completion felt nice.

Vibe Graveyard

METR study finds experienced developers were 19% slower with AI tools

Incident Details

Tech Stack

References

A productivity study with teeth

A perception gap

What the study does not prove

Why it belongs here

Discussion