Study finds 69 vulnerabilities across apps built by five leading AI coding tools

The Test

In December 2025, Israeli security startup Tenzai set out to answer a question that the vibe coding movement would rather not hear: if you hand the same specification to the most popular AI coding tools and ask each one to build a real application, how secure is the output?

The methodology was straightforward. Tenzai selected five of the most widely used AI coding tools on the market: Anthropic's Claude Code, OpenAI's Codex, Cursor, Replit, and Devin. Each tool was given identical specifications to build three test applications. Fifteen apps total, five tools, same requirements, no human code review after generation. Just raw AI output, deployed as-is.

The results, published in January 2026, were not encouraging. Across the 15 applications, Tenzai's researchers found 69 vulnerabilities, with several rated critical. That's an average of 4.6 vulnerabilities per application, which is roughly the count you'd expect from a junior developer working without a security-aware mentor - except these tools are marketed as the future of professional software development.

What Broke

The most interesting finding wasn't what the AI coding tools got wrong. It was what they got right versus what they missed, because the gap reveals a specific pattern in how these tools think about security.

Most of the tools handled basic SQL injection reasonably well. This is the textbook introductory security vulnerability - the one that appears in every web security 101 course, the one that has thousands of examples in training data. If your AI coding tool can't prevent basic SQL injection, you have a training data problem so severe that the tool probably shouldn't exist.

But move one step beyond the textbook examples, and the defenses fell apart.

Reverse transaction exploits were the most colorful finding. Multiple tools generated e-commerce applications where a user could set a negative quantity on a refund, effectively reversing the flow of money. Instead of the store refunding $50 to the customer, the customer refunds -$50 to themselves, which the system processes as a $50 payment to the customer on top of the original refund. It's the kind of business logic flaw that a human developer might catch during code review because it requires understanding what the application is supposed to do, not just whether the code compiles cleanly.

Broken authorization showed up across multiple tools. Applications were generated where authentication (proving who you are) was treated as a substitute for authorization (proving what you're allowed to do). A logged-in user could access other users' data, modify records they shouldn't have access to, or invoke administrative functions by simply calling the right API endpoint. The tools understood that applications need login screens. They did not consistently understand that logging in as User A shouldn't let you read User B's data.

Predictable API endpoints allowed attackers to enumerate through resources by incrementing IDs in URLs - the classic Insecure Direct Object Reference (IDOR) pattern. Change /api/users/42 to /api/users/43 and you're looking at someone else's profile. This is another vulnerability that appears extensively in security training materials, but apparently not extensively enough in the codebases these tools were trained on.

Insecure default configurations rounded out the findings. Applications were deployed with debug modes enabled, error messages that leaked internal system details, and permissive CORS policies that allowed any website to make API requests to the application.

Why the Training Data Gap Matters

The pattern in Tenzai's findings is consistent with a structural limitation of AI coding tools that the CodeRabbit study also documented: these tools are trained primarily on public code repositories, and public code repositories contain an enormous volume of insecure code alongside secure code.

For well-known vulnerability classes like SQL injection, the training data contains enough examples of both the vulnerable pattern and the secure pattern that the AI has learned to generate the secure version most of the time. The fix for SQL injection - parameterized queries - is so universally recommended that it dominates the training signal.

But for business logic vulnerabilities, there's much less training signal. A "reverse transaction" exploit isn't a pattern that appears in vulnerability databases or security tutorials. It's a domain-specific flaw that requires understanding the intent of the application, not just the syntax of the code. The AI sees "process refund for quantity X" and generates clean, syntactically correct code that processes refunds for any value of X, including negative values. Nothing in the training data told it that negative quantities are nonsensical in a refund context.

Authorization logic has a similar problem. The training data contains countless examples of authentication implementations - login forms, password hashing, session management. It contains far fewer examples of fine-grained authorization checks that enforce "user A can only access user A's resources." Authentication is a pattern. Authorization is a policy, and policies are specific to each application.

The Five-Tool Comparison

Tenzai declined to publish a full ranked comparison of the five tools, which is diplomatically convenient and scientifically understandable - the sample size of three applications per tool isn't enough for statistically robust tool-by-tool rankings. But the aggregate data tells a consistent story: no tool was free of critical vulnerabilities, and the types of vulnerabilities found were similar across tools.

This is actually the more alarming finding. If one tool had dramatically outperformed the others, the conclusion would be "use the better tool." Instead, the conclusion is that the vulnerability patterns are a property of the approach - LLM-based code generation from training data - rather than a property of any specific implementation. Switching from one AI coding tool to another doesn't solve the problem. It just shuffles which specific vulnerabilities you get.

Industry Context

Tenzai's study landed in a market that was simultaneously accelerating its adoption of AI coding tools and discovering the consequences of that adoption. By January 2026, the vibe coding movement had produced its own micro-genre of disaster stories. The Moltbook incident exposed 1.5 million authentication tokens from a vibe-coded social network. The Lovable edtech exposure leaked data from 18,000 users of a showcased application. Base44's auth bypass gave attackers access to every app built on the platform.

What Tenzai added to this growing body of evidence was controlled comparison. The earlier incidents were case studies - individual applications that happened to be insecure. Tenzai's methodology showed that the insecurity isn't a matter of individual developer negligence. It's a predictable output of the tools themselves, reproduced consistently across the leading platforms.

Tenzai, for its part, has a commercial interest in these findings. The company, founded in 2023, raised $75 million in seed funding in November 2025 to build an AI-powered continuous penetration testing platform specifically targeting vulnerabilities introduced by AI-generated code. Their business model relies on the premise that AI coding tools produce insecure code that needs automated security testing. Their study confirms that premise.

This doesn't invalidate the findings - the methodology was straightforward and the vulnerabilities are real - but it's worth noting that the entity publishing "AI coding tools produce insecure code" also sells the product designed to scan AI-coded applications for security flaws. The incentives align a little too neatly to ignore.

What the Study Means

Sixty-nine vulnerabilities across 15 applications is a data point, not a verdict. The sample is too small for definitive conclusions about any individual tool, and test applications may not fully represent real-world development where developers interact with and modify AI-generated code rather than deploying it untouched.

But the study adds to a growing evidence base that AI coding tools, as currently implemented, systematically produce categories of vulnerabilities that human developers are less likely to introduce - not because humans are better programmers, but because certain classes of security flaws require contextual understanding that pattern-matching from training data doesn't provide.

The practical implication is straightforward: AI-generated code needs security review, and that review needs to specifically target the vulnerability patterns that AI tools are known to produce. Standard code review practices - the kind designed to catch the kinds of bugs humans write - may not catch the kinds of bugs AI writes, because the failure modes are different.

For the vibe coding movement, which has positioned "just let the AI write it" as a feature rather than a risk, sixty-nine vulnerabilities across fifteen applications is an inconvenient number. Not because it proves AI coding tools are unusable, but because it quantifies the cost of the "vibe" in vibe coding: roughly 4.6 security vulnerabilities per application, including some that would let users steal money.

Vibe Graveyard