Across more than 100 large language models tested in production coding tasks, 45% of AI-generated code samples introduce OWASP Top 10 vulnerabilities — and that number has not improved in two years despite successive model releases claiming otherwise. Veracode's Spring 2026 GenAI Code Security Update is the most systematic dataset on this question to date, and its conclusion is uncomfortable for teams that have normalized AI code generation without adjusting their security posture.

This post assembles the primary evidence — vendor telemetry, peer-reviewed research, and secrets exposure data — to give engineering leaders a clear picture of where the risk actually sits and what controls meaningfully reduce it.

The Core Finding: Syntax Improved, Security Did Not

The divergence between functional correctness and security correctness is the most important pattern in the 2025–2026 data. Veracode's longitudinal testing shows syntax pass rates climbing from roughly 50% in 2023 to over 95% by early 2026. Security pass rates, by contrast, have remained flat at approximately 55% across the same period — an essentially unchanged failure rate regardless of model generation, parameter count, or provider.

That gap matters operationally. Code that compiles cleanly and passes functional tests gives reviewers a false confidence signal. The vulnerabilities are not syntax errors that surface during build — they are semantic flaws in access control, output encoding, and cryptographic implementation that require deliberate security analysis to find.

The data breaks down further by language and vulnerability class:

Source: Veracode Spring 2026 GenAI Code Security Update

The XSS and log injection numbers deserve attention: these are not exotic edge cases. They are standard web application vulnerability classes that any junior developer is expected to know. The models are generating output encoding errors at a rate that would fail a security training quiz.

Formal Verification Finds What SAST Misses

Standard static analysis tools substantially undercount the problem. A 2026 peer-reviewed study, "Broken by Default: A Formal Verification Study of Security Vulnerabilities in AI-Generated Code" (arXiv, April 2026), applied formal verification methods using Z3 satisfiability witnesses to 3,500 code samples across seven leading models. The headline finding: 55.8% of artifacts contained at least one formally proven vulnerability — and industry SAST tools missed 97.8% of those verified findings.

No model tested achieved better than a D grade on security correctness. The best performer — Gemini 2.5 Flash — reached 48.4% vulnerability rate. GPT-4o was the worst at 62.4%. These are not fringe models; they are the production-grade tools engineering teams use today.

A secondary finding from the same study is particularly relevant for code review workflows: models identified their own vulnerable outputs 78.7% of the time when asked to review them — yet generated those same vulnerabilities at 55.8% by default. The capability for security reasoning exists in the model; it is not invoked during generation unless explicitly required. This is the core argument for mandatory security-focused review gates in the pull request pipeline, rather than relying on generation-time prompting alone.

A 2025 peer-reviewed paper accepted at IEEE-ISTAS, "Security Degradation in Iterative AI Code Generation", tested 400 code samples across 40 rounds of AI-driven "improvements" using four distinct prompting strategies — efficiency-focused, feature-focused, security-focused, and ambiguous. The result challenges a common assumption: after just five iterations, critical vulnerabilities increased by 37.6%.

Even security-focused prompts did not reliably improve outcomes: only 27% of security-focused iterations produced net security improvements, and those gains were concentrated in the first three rounds. Beyond iteration three, security-focused prompting offered no consistent advantage over other strategies.

This has direct implications for agentic workflows where an AI coding agent iterates on its own output through multiple passes before opening a pull request. Each refinement pass is a fresh opportunity to introduce a vulnerability — and the PR that lands in review may have gone through ten or more such iterations with no human checkpoint in between.

Secrets Exposure: The Credential Layer of the Problem

Vulnerability rates in code logic are one dimension of the risk. Hardcoded secrets embedded in AI-generated commits are another, and the 2026 data here is severe.

According to the GitGuardian State of Secrets Sprawl 2026, 28.65 million new hardcoded secrets were added to public GitHub commits in 2025 — a 34% year-over-year increase and the largest single-year jump on record. AI-assisted development is a primary driver: commits co-authored with AI coding tools showed a 3.2% secret-leak rate versus a 1.5% baseline across all public GitHub commits, roughly double the baseline exposure.

The same report found 24,008 unique secrets in MCP configuration files on public GitHub, with 2,117 confirmed still valid. AI service credentials — LLM API keys, orchestration tokens, vector database credentials — surged 81% year-over-year, with LLM infrastructure secrets leaking five times faster than credentials for core model providers.

The remediation gap is as significant as the exposure rate: 64% of valid secrets exposed in 2022 remain unrevoked as of 2026. Secrets detection must be treated as a blocking gate, not a notification.

The Verification Gap: Developers Know but Do Not Act

The risk is not invisible to developers. Sonar's State of Code 2025 survey, covering more than 1,100 developers globally, found that 96% do not fully trust AI-generated output — yet only 48% consistently verify it before merging. AI now accounts for 42% of all committed code, a share developers expect to reach 65% by 2027. The math is straightforward: if nearly half of AI-assisted commits go unverified and 45% of AI-generated code has known vulnerability classes, the expected defect rate at merge is not a rounding error.

Sonar's data also shows that 88% of developers report negative impacts from AI coding, specifically around code that looks correct but fails under real conditions. The functional quality concern and the security quality concern are the same phenomenon: AI optimizes for pattern plausibility, not correctness.

What to Do Now

The data supports a specific set of interventions, ordered by impact. These are not aspirational controls — they are the gaps the evidence points to directly.

Add CWE-80 and CWE-117 to your mandatory SAST ruleset and treat failures as blocking. XSS and log injection are the two highest-failure vulnerability classes in AI-generated code (85% and 87% failure rates respectively). Standard SAST configurations frequently downgrade or suppress these as informational. Elevate them to build-breaking status for any PR with AI-generated changes.
Implement secrets scanning as a pre-merge gate, not a post-merge notification. The GitGuardian data confirms that AI-assisted commits leak secrets at double the baseline rate, and that 64% of exposed secrets remain unrevoked years later. Blocking merges on detected secrets stops the sprawl at the cheapest point in the pipeline.
Require a dedicated security review pass on AI-generated pull requests, distinct from functional review. The formal verification study confirms models can identify their own vulnerabilities 78.7% of the time when explicitly asked. Use that capability: require the PR author or a review bot to run a security-specific review prompt before human approval. This is not redundant with SAST — it catches semantic and logic-layer issues that static tools miss.
Apply heightened scrutiny to Java codebases. At a 29% security pass rate — roughly half the rate of Python — Java is a substantial outlier in Veracode's dataset. Teams running Java microservices or backend APIs with AI assistance should treat every AI-generated PR as high-risk by default.
Set iteration limits on agentic coding workflows. The iterative degradation study shows critical vulnerabilities increasing 37.6% after five iterations. For any agentic workflow that allows multiple self-revision passes before generating a PR, enforce a human checkpoint after no more than three iterations. Configure the workflow to require sign-off before further refinement, not just before merge.
Track AI-generated code as a distinct metric in your defect reporting. You cannot manage what you cannot measure. Tag AI-assisted commits in your pipeline and compare defect rates against human-authored code. The industry baseline gives you a benchmark; your internal data tells you whether you are above or below it and whether your controls are working.

The Model Size Fallacy

One finding from Veracode's testing deserves specific attention for engineering leaders evaluating tool selection: model parameter count shows minimal correlation with security performance. Whether the underlying model has 20 billion or 400 billion parameters, security pass rates cluster around the same 55% mark. Upgrading to a larger or more expensive model is not a security control. The only exception in the current data is reasoning-optimized models, which achieve 70–72% pass rates — still an unacceptable vulnerability rate by any reasonable standard, but meaningfully better than the baseline.

The implication for procurement decisions: model capability announcements and benchmark scores are not relevant evidence for your security posture. The evidence that matters is security pass rates on tasks that match your actual codebase — and your process controls at the pull request gate.

Engineering teams that govern AI-generated code at the pull request level — with automated risk scoring, blocking security gates, and explicit review requirements — are operating on the correct assumption: that the model will produce vulnerable code in roughly half of cases, and that the pipeline's job is to catch it before it ships. Teams that are not yet operating on that assumption are accumulating security debt with every commit. The data on what that debt costs is not encouraging. The controls to address it are well understood.

re-entry.ai provides pull request risk scoring and governance controls purpose-built for teams using AI coding agents. If you want to understand where your current pipeline has gaps, the tooling is at https://re-entry.ai.

Product

Support

Company

Product

45% of AI-Generated Code Fails Security Checks — What the 2025–2026 Data Says

Table of Contents

The Core Finding: Syntax Improved, Security Did Not

Formal Verification Finds What SAST Misses

Iterative Refinement Makes the Problem Worse

Secrets Exposure: The Credential Layer of the Problem

The Verification Gap: Developers Know but Do Not Act

What to Do Now

The Model Size Fallacy

AI-Generated Code Attribution: How to Track What Your Agents Wrote in Pull Requests

How to Measure AI Code Governance Maturity in Your Engineering Org

AI-Generated Pull Request Monitoring: Five High-Risk Signals to Catch Before Merge

45% of AI-Generated Code Fails Security Checks — What the 2025–2026 Data Says

Table of Contents

The Core Finding: Syntax Improved, Security Did Not

Formal Verification Finds What SAST Misses

Iterative Refinement Makes the Problem Worse

Secrets Exposure: The Credential Layer of the Problem

The Verification Gap: Developers Know but Do Not Act

What to Do Now

The Model Size Fallacy

More from the blog

AI-Generated Code Attribution: How to Track What Your Agents Wrote in Pull Requests

How to Measure AI Code Governance Maturity in Your Engineering Org

AI-Generated Pull Request Monitoring: Five High-Risk Signals to Catch Before Merge