45% of AI-generated code samples failed security testing β and most of those failures arrived in pull requests that looked clean to reviewers, according to Veracode's 2025 GenAI Code Security Report. The diff was tidy. The logic appeared sound. The problem was invisible to anyone scanning for obvious errors.
What Code Review Was Designed to Catch β And What It Isn't
Code review rests on two assumptions: that reviewers understand the full context of a change, and that the problems worth catching are visible in the diff. Both break under AI-generated code.
When a developer writes a function, they carry tacit knowledge of surrounding system constraints, security requirements, and architectural intent. An AI coding agent has none of that β it synthesizes from training data and prompt context. The output can be syntactically correct, pass linters, and satisfy style checks while still introducing:
Logic errors that only surface under specific runtime conditions
Hardcoded credentials or API key patterns that look like placeholder strings in the diff
Hallucinated package names that resolve to real malicious packages in public registries
OWASP Top 10 flaws embedded inside otherwise conventional boilerplate code
Veracode's analysis found that AI coding tools failed to defend against cross-site scripting (CWE-80) in 86% of relevant code samples tested. Java β the enterprise standard for backend services β hit a 72% security failure rate across AI-generated code. These are not edge cases; they are systematic patterns tied to how large language models generate code without awareness of the systems they are modifying.
The Structural Gap Between Review and Risk
Code review is synchronous and attention-bounded. A reviewer applies heuristics β they look for things that look wrong. AI-generated code often looks right: well-structured, convention-following, with clear variable names and inline comments. The risk lives in behavior, not syntax.
Research published in 2025 covering AI-assisted pull requests across enterprise codebases found that AI-generated PRs averaged 10.83 findings per review versus 6.45 for human-authored code β 1.7 times more issues per change. Logic and correctness errors appeared more than twice as often. Yet those pull requests frequently cleared standard review checklists without a flag.
A risk scoring layer approaches the same pull request with a fundamentally different lens:
Why Scoring Before Review Changes the ROI Calculation
The most immediate value of risk scoring is triage β deciding which pull requests need deep scrutiny and which can flow through without taxing senior reviewer bandwidth. Not every PR carries the same risk profile. A one-line documentation fix is not the same as an AI-generated authentication handler touching session management and database access.
Secrets exposure illustrates the gap precisely. GitGuardian's State of Secrets Sprawl 2026 found that AI-assisted commits carry a 3.2% secret-leak rate β more than double the 1.5% baseline across all public GitHub commits. AI-service credentials increased 81% year-over-year. These leaks rarely look like obvious secrets in a diff. They appear as long strings, configuration values, or environment variable references. A human reviewer scanning a pull request under time pressure is not reliably catching them.
The AgenticFlict dataset β a large-scale analysis of merge conflicts in AI coding agent pull requests across GitHub β makes a structural point: AI agents produce pull requests with measurable differences from human-authored code at scale, including distinct conflict patterns. Governing that output requires signals designed for how AI agents write code, not how humans do.
Risk scoring works at the signal layer before the human reviewer opens the diff. It converts a qualitative judgment call β is this PR worth careful review? β into a structured triage decision with a documented rationale. High-risk PRs get flagged and routed appropriately. Low-risk PRs clear without unnecessary friction.
Engineering teams that respond to AI coding risk by adding review hours are solving the wrong problem. The review process itself needs a signal layer β one that scores incoming PRs before they join the review queue, surfaces the risks that look clean in the diff, and gives reviewers the information they need to focus attention where it counts. More eyes on more code is not a governance strategy. Scored risk is.
re-entry.ai scores pull request risk for teams running AI coding agents β so reviewers see what the diff doesn't show. Learn more at re-entry.ai.