Why and How to Conduct Code Reviews: From Traditional Practices to AI Agents (Part 1)
The article explains why code review matters, how it should be performed, and how the rise of AI‑generated code reshapes review practices, introducing a five‑level review taxonomy and a methodology that combines atomic pull requests, layered reading, comment grading, and measurable SLAs.
Why Code Review
Code review is one of the most widely accepted quality practices in modern software engineering, yet its importance is often reduced to merely "bug hunting" or a perfunctory KPI.
In IBM's 1976 empirical study, Michael Fagan found that code review discovered 38 defects per thousand lines of code, compared with only 8 defects found by unit testing, accounting for 82% of known defects in the final product. Design and Code Inspections to Reduce Errors in Program Development
The timing of defect detection dramatically affects repair cost: later detection incurs exponentially higher costs, sometimes differing by an order of magnitude between development and production stages. By acting as the last manual gate before testing and deployment, code review can block defects at very low cost.
Code review also serves as a highly effective knowledge‑sharing mechanism because it occurs immediately after code is written, when the author’s intent is freshest.
AI‑Generated Code Increases Review Urgency
AI code generation has become a de‑facto standard tool, accelerating output without improving code trustworthiness. Adoption of AI correlates positively with higher defect rates, larger PR sizes, and longer review times. Consequently, review becomes more important—and more difficult—because AI‑generated changes demand deeper scrutiny.
Review Hierarchy (L1–L5)
Based on the BitsAI‑CR three‑layer classification (review dimensions, issue types, detailed criteria) and practical experience, the article defines five review levels:
L1: Style & Formatting
Naming conventions, indentation, comment quality, file organization. Highly formalizable and suitable for automated linters (ESLint, Pylint, RuboCop) and formatters (Prettier, Black). Human reviewers should not waste effort on these.
L2: Correctness & Robustness
Logical errors, boundary handling, exception coverage, null‑pointer and resource‑leak issues. This is the core responsibility of traditional code review and the layer where LLMs can contribute most.
L3: Security
SQL injection, XSS, CSRF, unsafe deserialization, hard‑coded keys, missing permission checks. Overlaps with L2 but requires specialized security knowledge. Traditionally covered by SAST tools (SonarQube, CodeQL, Semgrep); LLMs can surpass rule‑based engines by understanding business context.
L4: Design & Architecture
Module coupling, interface contracts, SOLID violations, unnecessary abstractions, and alignment with system evolution. Requires a cross‑file, global perspective—an area where static analysis falls short and AI agents aim to excel. Most technical debt originates from issues at this level.
L5: Business Semantics
Whether the code solves the right problem, matches real requirements, and avoids undocumented side effects. This level depends heavily on domain knowledge and remains largely irreplaceable by AI.
The five levels are not independent; higher‑level issues (e.g., L4) discovered late force reviewers to touch lower‑level code (L1–L3), increasing overall effort.
Review Methodology
Atomic Changes as Prerequisite
Google Engineering Practices recommends small, focused changelists. Small PRs (<200–400 lines of net change) yield higher review completeness, lower merge risk, and easier regression isolation. Mixing refactoring, new features, and bug fixes in a single PR makes fault isolation difficult.
Layered Reading: Architecture First, Details Later
Review steps:
Read the PR description to understand "what" and "why".
Inspect the file‑tree changes to form an overall architectural impression.
Identify critical paths (core business logic, public APIs, database operations).
Deep‑dive review of critical paths (L2–L4).
Delegate remaining files to tooling for L1–L2 checks.
Comment Grading
Different issue levels receive different weights:
blocker : must be fixed before merge; concerns correctness, security, or architectural consistency.
suggestion : optional change; requires justification if applied.
nit : minor improvement; can be addressed in the current or a future PR.
This grading clarifies which comments need immediate response and prevents style preferences from masquerading as technical problems.
Quantifiable SLA
Review latency is an implicit efficiency killer. Common SLA targets include first response within 8 hours for weekday PRs, 24 hours for cross‑timezone PRs, and 2 hours for hot‑fixes. The goal is to turn vague "review delay" into a trackable engineering metric.
Traditional Review vs. AI‑Enhanced Review
Limitations of Traditional Review
Context‑switching imposes cognitive cost; large PRs cause superficial reviews; review depth varies across reviewers and time, making quality inconsistent.
LLM‑Assisted Stage
Early AI tools overlay a "smart comment bot" onto the GitHub UI, feeding diffs to an LLM and pasting line‑level comments back. Problems include limited context window, added noise to discussion threads, and lack of tool‑calling capability. LLMs handle L1–L2 well but struggle with L3–L4.
AI Agent Stage
AI agents differ by possessing action capabilities beyond generation. A full‑featured code‑review agent includes:
Perception : read diffs, access full repository tree, retrieve PR history, search related issues.
Reasoning : trace cross‑file call chains, understand module dependencies, identify architectural patterns.
Tool Use : run linters, execute test suites, invoke SAST tools, perform code‑base searches.
Action : generate structured comments, submit GitHub reviews, trigger CI pipelines.
Reflection : verify its own output to reduce false positives.
This capability set enables agents to address L3–L4 problems, moving from single‑step comment generation to multi‑step decision making.
Conclusion
Code review’s value extends beyond bug detection to creating a systematic quality feedback loop. The L1–L5 hierarchy guides the allocation of human effort and automation: low‑level mechanical checks belong to tools, while high‑level design and business‑semantic judgments require human insight. AI reshapes this division—LLM‑assisted tools push automation to L2, and emerging AI agents aim to conquer L3–L4, albeit with added architectural complexity and deeper integration needs.
References:
Fagan, M. (1976). Design and Code Inspections to Reduce Errors in Program Development.
Software Engineering at Google.
BitsAI‑CR framework: https://arxiv.org/abs/2501.15134
Google Engineering Practices: https://github.com/google/eng-practices
Abseil SWE Book, Chapter 9: https://abseil.io/resources/swe-book/html/ch09.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI Engineer Programming
In the AI era, defining problems is often more important than solving them; here we explore AI's contradictions, boundaries, and possibilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
