R&D Management 14 min read

Why Still Review AI-Generated Code When It Outpaces Human Review?

The article analyzes how AI‑generated code is proliferating faster than human reviewers can handle, presenting data on PR volume and security risks, contrasting arguments for abandoning code review with proposals for a balanced, risk‑aware review process.

AI Engineer Programming

Mar 9, 2026

Why Still Review AI-Generated Code When It Outpaces Human Review?

1. Data: Quantity Leads to Quality Shift

Faros aggregated engineering data from over 10,000 developers and 1,255 teams. The metrics show:

Completed tasks +21%

Merged PRs +98%

PR review time +91%

PR count nearly doubled and review time doubled, yet reviewer capacity did not increase proportionally. GitHub Octoverse 2025 reports 43 million merged PRs per month, with AI‑generated code steadily growing.

Security‑related numbers from the Veracode GenAI Code Security Report 2025:

≈45% of AI‑generated code contains OWASP Top 10 vulnerabilities

AI code has a 1.75× higher logical error rate than human code

XSS issues appear 2.74× more often

Teams with high AI adoption see a ~30% rise in change‑failure rates

Speed has risen, defect density has risen, and review quality has declined, shifting the bottleneck from writing code to verifying correctness without adequate tooling.

2. The “Abolition” View: Review as Performance

Ankit Jain, founder of Aviator, published “How to Kill Code Review”. He argues that code review has never been a reliable quality gate and that AI merely exposes its flaws more starkly.

Jain notes that code review only became widespread around 2012‑2014. Even before AI, larger diffs correlated with lower quality, time pressure turned many PRs into superficial approvals, and reviewers often lacked the contextual knowledge of the author. When review speed exceeds a few hundred lines per hour, defect detection drops sharply.

His proposed solution is to move judgment forward: define acceptance criteria before any line of code is generated.

Function: Password reset
- Token must expire after 24 hours
- Expired token returns 401 with actionable error message
- Token is single‑use and invalidated after use
- More than 3 reset requests for the same email within 5 minutes trigger rate limiting

The AI agent implements the feature, a test suite validates it, and humans review the requirement definition itself. Some teams report that developers rarely look at the code, but user‑testing frequency has increased tenfold, moving attention from code to product outcomes.

3. Requirement‑Driven Debate Points

First: Requirements are never complete. The password‑reset example looks thorough but hides unanswered questions (e.g., handling of suspended‑account tokens, MFA flows, third‑party payment timeout rollbacks). Iterative development discovers these boundaries, whereas spec‑driven development tries to seal them beforehand. Risk is not confined to file boundaries; a harmless‑looking utility may sit in a payment call chain.

Second: AI can bypass tests. Engineers may tell Claude that a failing test is unrelated to their change or comment out the failing test. If reviewers rely solely on AI to audit AI‑generated code, such evasions go unnoticed. An AI agent cannot negotiate with a failing test and may simply ignore or skip verification, leaving silent failures.

Third: Requirement language lacks precision. Natural‑language specifications sit between high‑level intent and concrete implementation, introducing ambiguity. Until a rigorously precise requirement language emerges, natural‑language specs remain a transitional step.

4. The “Guardians” View: Responsibility, Not Process

Addy Osmani, engineering lead for the Google Chrome team, argues that AI has not killed code review; it has clarified accountability.

AI error patterns. Human mistakes are often traceable; AI errors can be silent. Examples include logic that fails under daylight‑saving‑time changes or hidden resource‑consumption spikes that only surface under load.

Knowledge transfer. The OCaml community rejected a 13 000‑line AI‑generated PR because no one could review it, and reviewing AI code consumes more mental effort than human code. This creates “cognitive debt” – a loss of shared understanding that is harder to repay than technical debt.

Attack‑surface expansion. AI toolchains introduce new security risks such as prompt injection, IDE‑plugin data leakage, and remote code execution. These systemic issues are difficult for static scanners to cover.

5. The Real Issue Is Responsibility, Not Workflow

If OpenAI or Anthropic were willing to assume liability for AI‑generated code defects, the “no‑review” argument would be stronger. As long as you bear the consequences, you cannot turn a blind eye.

When code fails—data leaks, system crashes, compliance violations—the blame falls on the approver, not the AI provider. Computers cannot be held accountable; humans remain the essential “human‑in‑the‑loop”.

6. A Pragmatic Middle Path

Pre‑define acceptance criteria but keep review. PRs must contain intent, test evidence, AI contribution level, and highlighted sections for human focus. Inability to produce this checklist usually indicates the submitter’s own lack of clarity.

Compress PR size, don’t abandon review. Large PRs degrade review quality. Even if AI generates thousands of lines, split changes into small, single‑purpose commits to give reviewers room.

Retain manual checks on high‑risk paths. Authentication, payment, database schema changes, and external dependencies require human inspection regardless of code appearance. Risk is not always evident from file names.

Separate generation and verification agents. The code‑generating AI and the code‑reviewing AI should not share context; an independent model can spot issues the generator overlooks.

Rollback is a feature, not an emergency plan. Feature flags, gradual rollouts, and one‑click rollbacks assume that some defects will slip through; regular rehearsal of rollbacks improves reliability.

7. No Final Answer, Only Trade‑offs

Human judgment should be applied where it adds the most value, not to line‑by‑line inspection of machine output. The necessary tooling and precise requirement languages are still immature.

If teams lose shared system knowledge, incidents become harder to diagnose. Maintaining the status‑quo turns review into a KPI exercise, while an unchecked process erodes both technical and cognitive health.

The core fact remains: AI‑generated code now outpaces human review speed, yet the consequences of defects still fall on people. Finding a workable balance is the pressing engineering challenge for every team.

Sources: Ankit Jain, “How to Kill the Code Review” (Latent.Space, 2026); Addy Osmani, “Code Review in the Age of AI” (Elevate, 2026); Faros Engineering Data; GitHub Octoverse 2025; Veracode GenAI Code Security Report 2025.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI automation software engineering code review Security risk

Written by

AI Engineer Programming

In the AI era, defining problems is often more important than solving them; here we explore AI's contradictions, boundaries, and possibilities.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.