R&D Management 13 min read

Why and How to Conduct Code Reviews: From Traditional Practices to AI Agents (Part 1)

The article explains why code review matters, how it should be performed, and how the rise of AI‑generated code reshapes review practices, introducing a five‑level review taxonomy and a methodology that combines atomic pull requests, layered reading, comment grading, and measurable SLAs.

AI Engineer Programming
AI Engineer Programming
AI Engineer Programming
Why and How to Conduct Code Reviews: From Traditional Practices to AI Agents (Part 1)

Why Code Review

Code review is one of the most widely accepted quality practices in modern software engineering, yet its importance is often reduced to merely "bug hunting" or a perfunctory KPI.

In IBM's 1976 empirical study, Michael Fagan found that code review discovered 38 defects per thousand lines of code, compared with only 8 defects found by unit testing, accounting for 82% of known defects in the final product. Design and Code Inspections to Reduce Errors in Program Development

The timing of defect detection dramatically affects repair cost: later detection incurs exponentially higher costs, sometimes differing by an order of magnitude between development and production stages. By acting as the last manual gate before testing and deployment, code review can block defects at very low cost.

Code review also serves as a highly effective knowledge‑sharing mechanism because it occurs immediately after code is written, when the author’s intent is freshest.

AI‑Generated Code Increases Review Urgency

AI code generation has become a de‑facto standard tool, accelerating output without improving code trustworthiness. Adoption of AI correlates positively with higher defect rates, larger PR sizes, and longer review times. Consequently, review becomes more important—and more difficult—because AI‑generated changes demand deeper scrutiny.

Review Hierarchy (L1–L5)

Based on the BitsAI‑CR three‑layer classification (review dimensions, issue types, detailed criteria) and practical experience, the article defines five review levels:

L1: Style & Formatting

Naming conventions, indentation, comment quality, file organization. Highly formalizable and suitable for automated linters (ESLint, Pylint, RuboCop) and formatters (Prettier, Black). Human reviewers should not waste effort on these.

L2: Correctness & Robustness

Logical errors, boundary handling, exception coverage, null‑pointer and resource‑leak issues. This is the core responsibility of traditional code review and the layer where LLMs can contribute most.

L3: Security

SQL injection, XSS, CSRF, unsafe deserialization, hard‑coded keys, missing permission checks. Overlaps with L2 but requires specialized security knowledge. Traditionally covered by SAST tools (SonarQube, CodeQL, Semgrep); LLMs can surpass rule‑based engines by understanding business context.

L4: Design & Architecture

Module coupling, interface contracts, SOLID violations, unnecessary abstractions, and alignment with system evolution. Requires a cross‑file, global perspective—an area where static analysis falls short and AI agents aim to excel. Most technical debt originates from issues at this level.

L5: Business Semantics

Whether the code solves the right problem, matches real requirements, and avoids undocumented side effects. This level depends heavily on domain knowledge and remains largely irreplaceable by AI.

The five levels are not independent; higher‑level issues (e.g., L4) discovered late force reviewers to touch lower‑level code (L1–L3), increasing overall effort.

Review Methodology

Atomic Changes as Prerequisite

Google Engineering Practices recommends small, focused changelists. Small PRs (<200–400 lines of net change) yield higher review completeness, lower merge risk, and easier regression isolation. Mixing refactoring, new features, and bug fixes in a single PR makes fault isolation difficult.

Layered Reading: Architecture First, Details Later

Review steps:

Read the PR description to understand "what" and "why".

Inspect the file‑tree changes to form an overall architectural impression.

Identify critical paths (core business logic, public APIs, database operations).

Deep‑dive review of critical paths (L2–L4).

Delegate remaining files to tooling for L1–L2 checks.

Comment Grading

Different issue levels receive different weights:

blocker : must be fixed before merge; concerns correctness, security, or architectural consistency.

suggestion : optional change; requires justification if applied.

nit : minor improvement; can be addressed in the current or a future PR.

This grading clarifies which comments need immediate response and prevents style preferences from masquerading as technical problems.

Quantifiable SLA

Review latency is an implicit efficiency killer. Common SLA targets include first response within 8 hours for weekday PRs, 24 hours for cross‑timezone PRs, and 2 hours for hot‑fixes. The goal is to turn vague "review delay" into a trackable engineering metric.

Traditional Review vs. AI‑Enhanced Review

Limitations of Traditional Review

Context‑switching imposes cognitive cost; large PRs cause superficial reviews; review depth varies across reviewers and time, making quality inconsistent.

LLM‑Assisted Stage

Early AI tools overlay a "smart comment bot" onto the GitHub UI, feeding diffs to an LLM and pasting line‑level comments back. Problems include limited context window, added noise to discussion threads, and lack of tool‑calling capability. LLMs handle L1–L2 well but struggle with L3–L4.

AI Agent Stage

AI agents differ by possessing action capabilities beyond generation. A full‑featured code‑review agent includes:

Perception : read diffs, access full repository tree, retrieve PR history, search related issues.

Reasoning : trace cross‑file call chains, understand module dependencies, identify architectural patterns.

Tool Use : run linters, execute test suites, invoke SAST tools, perform code‑base searches.

Action : generate structured comments, submit GitHub reviews, trigger CI pipelines.

Reflection : verify its own output to reduce false positives.

This capability set enables agents to address L3–L4 problems, moving from single‑step comment generation to multi‑step decision making.

Conclusion

Code review’s value extends beyond bug detection to creating a systematic quality feedback loop. The L1–L5 hierarchy guides the allocation of human effort and automation: low‑level mechanical checks belong to tools, while high‑level design and business‑semantic judgments require human insight. AI reshapes this division—LLM‑assisted tools push automation to L2, and emerging AI agents aim to conquer L3–L4, albeit with added architectural complexity and deeper integration needs.

References:

Fagan, M. (1976). Design and Code Inspections to Reduce Errors in Program Development.

Software Engineering at Google.

BitsAI‑CR framework: https://arxiv.org/abs/2501.15134

Google Engineering Practices: https://github.com/google/eng-practices

Abseil SWE Book, Chapter 9: https://abseil.io/resources/swe-book/html/ch09.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI agentsLLMsoftware engineeringcode reviewreview methodology
AI Engineer Programming
Written by

AI Engineer Programming

In the AI era, defining problems is often more important than solving them; here we explore AI's contradictions, boundaries, and possibilities.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.