Every AI model fails security test across 31 coding scenarios

Tombstone icon

Armis Labs tested 18 leading generative AI models across 31 security-critical code generation scenarios and found a 100% failure rate - not one model could consistently produce secure code. In 18 of those 31 challenges, every single model generated code containing Common Weakness Enumeration vulnerabilities. The best performer, Gemini 3.1 Pro, still produced OWASP Top 10 flaws in nearly 39% of scenarios. Older proprietary models fared worse, and the report found no correlation between price and security. The "Trusted Vibing Benchmark" dropped the same week enterprises were mandating AI-assisted development at scale, which is either very good timing or very bad timing depending on your relationship to a production deployment.

Incident Details

Severity:Facepalm
Company:Industry-wide (18 AI models tested by Armis Labs)
Perpetrator:Developer
Incident Date:
Blast Radius:Industry-wide; every major AI code generation model tested produces security vulnerabilities at scale, with implications for any organization using AI-assisted development in production

The Benchmark

On March 23, 2026, cybersecurity firm Armis released the "Trusted Vibing Benchmark" - a name that manages to be both extremely 2026 and uncomfortably accurate. Armis Labs tested 18 of the most widely used generative AI models against 31 code generation scenarios designed to probe security-critical functionality. The scenarios covered the kinds of features that actually matter in production software: authentication systems, file uploads, memory buffer handling, input validation, and access control.

The methodology focused on "atomic" features - individual functions or components rather than full applications - so each model was being evaluated on its ability to write one secure function at a time. Not a whole app. Not an architecture. Just: "write this one thing, and don't introduce a vulnerability." The bar was deliberately low.

Every model cleared it precisely zero percent of the time.

The Numbers

The headline finding: 100% of the 18 models tested failed to consistently generate secure code across the scenarios. Not "most models struggled." Not "some models had issues in edge cases." All of them. Every one. Across all 31 scenarios, not a single model could reliably produce code free of Common Weakness Enumeration (CWE) vulnerabilities.

Breaking it down further: in 18 of the 31 specific challenges, every single model produced code containing CWE vulnerabilities. That's 58% of scenarios where the failure rate was unanimous. The "universal blind spots" clustered around the areas that matter most for security: memory buffer overflows, design file upload handling, and authentication logic.

These aren't exotic vulnerability classes. Buffer overflows have been documented since the 1970s. SQL injection prevention has been taught in introductory security courses for two decades. Broken authentication is the second item on the OWASP Top 10 list, which has existed since 2003. AI models that can write poetry, pass bar exams, and generate photorealistic images apparently still can't reliably remember to validate user input before passing it to a database query.

The Leaderboard

Not all models failed equally, which might be comforting if you squint hard enough. Gemini 3.1 Pro came out on top, posting the lowest combined rate of OWASP Top 10 and Armis Early Warning CWE vulnerabilities at 38.71%. It was also the only model that completely avoided generating what Armis called "compounding security failures" - scenarios where multiple vulnerabilities stacked together to create catastrophic exposure.

A 38.71% vulnerability rate in the best-performing model means that roughly two out of every five security-critical code generation tasks produced flawed output. For the best model. The models trailing behind Gemini 3.1 Pro posted significantly higher vulnerability counts.

Older proprietary models, including versions of Claude Sonnet 4.5, Claude Haiku 4.5, and Gemini 2.5 Pro, were singled out as presenting "severe security risks" - higher vulnerability counts, more frequent critical flaws, and a greater tendency to produce the kind of compounding failures that turn a single bug into a full compromise. The report specifically noted that these models lacked "baseline security guardrails" in their code generation output.

One finding that should make procurement teams uncomfortable: price had nothing to do with it. The report found no correlation between model cost and security performance. Some low-cost open-source models outperformed expensive proprietary alternatives on security metrics. Paying more for your AI coding assistant doesn't mean it writes more secure code - it just means you paid more for code that still has OWASP Top 10 vulnerabilities in it.

The Vulnerability Classes

The types of vulnerabilities the models produced are the same ones that have been documented in AI-generated code by multiple previous studies, but the Armis benchmark provides granularity that earlier research lacked.

Cross-Site Scripting (CWE-79) and SQL Injection (CWE-89) appeared consistently across models. These are the two most basic categories of web application vulnerability - the ones that every security training course covers in the first week, the ones that every web framework has built-in protections for. AI models generate code that bypasses those protections because the models don't understand why the protections exist; they just predict tokens that are statistically likely to follow the prompt.

Authentication and session management flaws showed up with alarming frequency. Models would generate login functions that stored passwords in plaintext, session handling that didn't expire tokens, or access control checks that could be bypassed by manipulating URL parameters. These are the vulnerabilities that lead to "Company X exposes 200 million user records" headlines.

The memory buffer overflow findings are particularly concerning for code generated in languages like C and C++, where buffer management is manual. A buffer overflow in a web application written in Python is usually caught by the runtime. A buffer overflow in a C application can be a remote code execution vulnerability. AI models generating systems-level code with buffer handling errors are producing exactly the kind of vulnerability that attackers have been exploiting since Morris worm days.

The Perception Gap

Armis paired the benchmark with data from their 2026 Cyberwarfare Report, which surveyed global IT decision-makers. The numbers tell a story about collective denial. According to the survey, 77% of IT decision-makers trust the integrity of third-party code used in critical applications. Meanwhile, 16% admitted they don't actually know whether that code has been thoroughly checked for vulnerabilities.

Let those two numbers sit next to each other for a moment. More than three-quarters of decision-makers trust code they're deploying in critical systems. Nearly one in five don't know if anyone has actually checked it for security flaws. That was the state of affairs before AI code generation started producing vulnerabilities at the rate Armis documented.

The report calls this the "dangerous perception gap" - the distance between how secure organizations believe their AI-generated code is and how secure it actually is. When every model in the market produces CWE vulnerabilities in a majority of security-critical scenarios, and most organizations aren't checking the output, the gap between perceived and actual security is not a gap. It's a canyon.

What This Means for Vibe Coding

The timing of the Trusted Vibing Benchmark is significant. By March 2026, AI-assisted development has moved from "interesting experiment" to "company-wide mandate" at multiple major technology firms. Amazon required 80% of its engineers to use its AI coding tools. Microsoft reported that 30% of code in some repositories was AI-generated. Vibe coding platforms aimed at non-developers were generating production applications wholesale.

Armis's data suggests that this adoption has outpaced the development of security safeguards. Nadir Izrael, Armis's CTO and co-founder, stated in the report: "The era of vibe coding is here, but speed should not come at the cost of security." The benchmark was designed to provide concrete data that organizations could use to evaluate which models they should trust with which kinds of code generation tasks, and under what conditions.

The answer the benchmark gives is: none of them, unconditionally, for security-critical code. Every model needs human review for security-sensitive output. This isn't news to security professionals, who have been saying this since AI coding assistants first appeared. But the Armis benchmark provides the quantitative evidence - 18 models, 31 scenarios, 100% failure rate - that transforms the warning from "we think this might be a problem" to "we tested it and it is definitely a problem, here are the CWE numbers."

The Accumulation Problem

The Trusted Vibing Benchmark arrives alongside a growing pile of studies documenting AI code security failures. CodeRabbit's December 2025 study found AI-generated code has 2.74 times more security vulnerabilities than human-written code. Veracode's 2025 report found 45% of AI-generated code contains security flaws. Tenz.AI's research documented specific vulnerability patterns across coding assistants.

Each study individually is concerning. Taken together, they form a consistent picture: AI code generation models do not produce secure code at rates acceptable for production use without human security review. The Armis benchmark extends this finding by testing the newest generation of models and finding that the fundamental problem has not been solved. Models have gotten better at writing syntactically correct, functionally complete code. They have not gotten correspondingly better at writing secure code.

The Armis report recommends that organizations implement what it calls "AI-native application security controls" - security scanning and review processes designed specifically for the patterns of vulnerability that AI-generated code produces, rather than retrofitting traditional code review processes that were designed for human-authored code. Whether organizations will actually implement those controls before the next breach traced to AI-generated code is another question. The benchmark data suggests they should hurry.

Discussion