Veracode tested AI-generated code from 100+ models and 45% of it failed security checks
Veracode's 2025 GenAI Code Security Report examined code output from more than 100 large language models across 80+ coding tasks and found that 45% of AI-generated code samples contained security vulnerabilities, including OWASP Top 10 flaws. Cross-Site Scripting had an 86% failure rate and Log Injection hit 88%. Java was the worst performer at over 70%. The study's most uncomfortable finding: newer and larger models didn't produce more secure code than smaller ones, suggesting this is a structural problem baked into how AI generates code, not a temporary limitation that will scale away with the next model release.
Incident Details
Tech Stack
References
Veracode makes its living selling application security testing. Their business model depends on code having bugs, so their perspective on AI-generated code might seem self-serving. But their 2025 GenAI Code Security Report backed up its claims with a scale and methodology that's hard to dismiss: 80+ coding tasks, over 100 LLMs, and specific vulnerability classifications that point to structural problems rather than cherry-picked examples.
The numbers
Veracode gave more than 100 large language models a battery of 80+ coding tasks - the kind of things a developer might ask an AI to generate in a real working session. Then they ran the output through security testing. The headline result: 45% of the code samples introduced security vulnerabilities.
These weren't obscure, theoretical weaknesses that would require a state-sponsored attacker to exploit. They were OWASP Top 10 vulnerabilities - the well-documented, widely-understood security flaws that every web application security course warns about in week one. Cross-Site Scripting (CWE-80, for the CVE enthusiasts) showed up in 86% of relevant code samples. Log Injection (CWE-117) hit 88%. SQL injection, hardcoded credentials, and insecure cryptographic practices all made regular appearances.
When the AI was presented with a choice between a secure and insecure way to accomplish the same task, it chose the insecure option 45% of the time. Not because it was being adversarial or because the prompt was poorly worded, but because the insecure pattern was more common in its training data and therefore more likely to be generated.
Language-by-language breakdown
Java came out worst, with a security failure rate exceeding 70%. For a language that dominates enterprise software, bank backends, and healthcare systems, that number is particularly uncomfortable. The other major languages fared somewhat better but not well: Python, C#, and JavaScript all showed failure rates between 38% and 45%.
The language-specific patterns make sense when you consider what AI code generation actually does. The models learned to code by training on enormous repositories of existing code, and existing code is full of security vulnerabilities. Java's long history and massive codebases mean there's proportionally more insecure Java code in the training data for models to learn from. The AI doesn't understand that a particular pattern is insecure - it just knows the pattern is common.
The scaling problem that isn't a scaling problem
The study's most significant finding had nothing to do with any specific vulnerability. It was this: newer and larger models did not produce significantly more secure code than smaller, older ones.
In most areas of AI capability, scaling helps. Bigger models with more training data produce better results on benchmarks, write more coherent text, solve harder math problems. The assumption across the industry has been that code security would follow the same trajectory - that future models would write more secure code as they got larger and more capable.
Veracode's data suggests otherwise. The models aren't getting better at security because they're not getting better at understanding security. They're getting better at generating code that compiles and runs, which is a different skill entirely. A model can produce perfectly functional code that logs user input without sanitization, stores passwords in plaintext, or constructs SQL queries through string concatenation. The code works. It's also vulnerable.
This finding undermines the common defense of AI code generation, which goes roughly: "Sure, the current models make security mistakes, but the next generation will be better." If security improvement doesn't correlate with model scaling - if a model with 100 billion parameters writes code that's roughly as vulnerable as a model with 10 billion parameters - then waiting for better models is not a security strategy.
The vibe coding factor
Veracode's report addressed a practice the report explicitly called out: "vibe coding." The term describes a workflow where developers provide natural language descriptions of what they want, let the AI generate the code, and integrate the results without deeply reviewing or specifying security requirements.
When a developer writes code manually, security decisions are explicit. You choose to use parameterized queries or string concatenation. You choose to hash passwords or store them in plaintext. You choose to validate input or trust it. When AI generates the code, those decisions are made by the model's statistical patterns, and whatever the training data did most often is what the model will reproduce.
A developer who reviews AI-generated code with security in mind can catch these issues. A developer who accepts the output at face value - vibing with it - inherits whatever security posture the model's training data had, which Veracode's data suggests is not a posture anyone should be comfortable with.
The report found that fewer than half of developers review AI-generated code before committing it. For those developers, the 45% vulnerability rate isn't a warning about something that might happen if they're not careful. It's a description of what's already in their codebase.
What the training data teaches
The root cause is well-understood, even if it's hard to fix. AI models learn to write code by training on public code repositories. Public code repositories are full of security vulnerabilities. The OWASP Top 10 exists because these vulnerabilities are common in real-world code. The AI learns the common patterns. The common patterns include the vulnerable ones.
This creates a feedback loop: AI generates vulnerable code, developers deploy it, it enters public repositories, future models train on it, and the pattern repeats. Each generation of AI learns from code written by the previous generation of AI, and neither generation has a mechanism for distinguishing "this pattern is common because it's correct" from "this pattern is common because nobody bothered to fix it."
What it means for organizations
For organizations using AI code generation tools - which, by 2025, includes most software development teams to some degree - the Veracode report quantifies a risk that many had acknowledged in the abstract but hadn't measured. Roughly half the code your AI assistant produces has security flaws. Your developers might not be reviewing it before committing. And switching to a newer, bigger model won't fix the problem.
The recommended mitigations are predictable but worth stating: integrate security scanning into the CI/CD pipeline so vulnerable code is caught before it reaches production, require manual review of AI-generated code (especially security-sensitive components), and don't rely on the AI to make security decisions by default.
None of this is new advice. Security teams have been saying these things since AI code generation became mainstream. What the Veracode report adds is the scale of evidence: not anecdotes, not theoretical arguments, but measured failure rates across 100+ models on standardized tasks. The AI writes vulnerable code 45% of the time. Whether that number changes anyone's behavior is a different question.