Managing AI‑Generated Code with an Agent‑Based Evaluation Framework: Lessons from Refactoring 310 K Lines

When over 90% of a codebase is produced by AI, the authors show how a unified "people‑align → human‑machine‑align" approach, driven by evaluation agents, transforms technical debt into incremental business work, enabling continuous refactoring, AI‑friendly standards, and a sustainable engineering environment.

ITPUB
ITPUB
ITPUB
Managing AI‑Generated Code with an Agent‑Based Evaluation Framework: Lessons from Refactoring 310 K Lines

Background

The Agent evaluation system supports multiple core business scenarios, handling data production, workflow orchestration, quality control, and multi‑person collaboration. Complexity appears in three dimensions: vague, exploratory requirements; rapid growth from 5 K to 310 K lines of code with ~ 16 monthly demands; and a "Cartesian‑product"‑level matrix of multimodal data evaluation, business‑task views, and dozens of quality‑check mechanisms.

Why Refactor?

Three pain points forced a large‑scale rewrite:

Business models outgrew the legacy architecture, causing "siloed" feature development.

Severe technical debt produced "spaghetti" code, making any change affect the entire system.

Rapid team expansion and 90% AI‑generated code amplified inconsistency, risking uncontrolled code decay without strict constraints.

Thus, the team needed not only engineering refactor but also AI‑compatible development standards to eliminate existing debt and prevent new debt.

Refactoring Timeline and Execution Path

Phase 1 — Define Problems, Use AI to Surface Technical Debt (Feb 2026)

Human reviewers identified high‑risk areas, then handed exhaustive scanning to AI. This hybrid "expert experience + AI assistance" quickly uncovered P0/P1 debt such as business‑model flaws, database query bottlenecks, state‑management issues, and index problems. The key insight: AI excels at seeing the whole system, but humans must decide which problems are most important.

Engineers, using AI, pinpointed ten deep‑hidden performance issues that would have been nearly impossible to find through manual code review.

The result reshaped the notion of experience: instead of "being able to see everything" (which AI now provides), the valuable human skill becomes "judging what matters".

Phase 2 — Research and Establish AI‑Friendly Development Standards (Feb 2026 – Mar 2026)

With debt identified, the team asked how to make code quality consistent when 90% of it is AI‑generated. They introduced a two‑step alignment model:

Standard alignment (people‑align): a single strong role synchronises product, operations, algorithm, QA, and other stakeholders on evaluation criteria.

Human‑machine alignment: after standards are fixed, AI models are tuned and metrics are set so that the AI‑human agreement rate reaches a basic threshold (e.g., 90%).

Standards were codified as always‑loaded AI Rules and Skills, embedded in the pre‑PR stage to enforce constraints before code submission.

Phase 3 — Build SOP and Incremental Refactor (Mar – Apr 2026)

Refactoring focused on four‑layer architecture (Starter / Application / Infrastructure / Common) and on eliminating deep coupling of PO objects across the call chain. The work proceeded in three steps:

Action 1: 100% AI‑assisted engineering layer decoupling – migrate from monolithic packages to domain‑oriented modules, while AI handles bulk code movement.

Action 2: Zero‑schedule refactor – treat technical debt as side‑effects of regular business stories, avoiding dedicated refactor sprints.

Action 3: Refactor quality assurance – introduce AI‑driven pre‑PR checks, high‑level model‑to‑model reviews, and cross‑vendor model audits to keep code‑review throughput from becoming a bottleneck.

These actions turned a traditionally resource‑heavy rewrite into a parallel, team‑wide effort guided by SOPs that AI could execute.

Key Takeaways

Managing AI coding follows the same "people‑align → human‑machine‑align" logic used in Agent evaluation.

AI shifts the value of experience from "seeing the whole system" to "judging what is important"; engineers discovered ten hidden performance risks within hours.

Technical debt can be consumed like business demand: break debt into side‑tasks attached to regular features, enabling continuous, zero‑schedule refactor.

When AI writes most code, engineers should focus on designing and maintaining an environment that reliably guides AI output.

Action Guide for Other Teams

Step 1: Inventory technical debt. Let core developers define high‑risk zones and let AI perform exhaustive scans.

Step 2: Create AI‑friendly standards and embed them as always‑loaded AI Rules and Skills after achieving team consensus.

Step 3: Have a senior engineer prototype a full migration SOP; then distribute the SOP for the whole team to follow with AI assistance.

Step 4: Implement a pre‑PR mechanism where AI automatically checks code against the standards, allowing human reviewers to focus on business semantics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI Codingcontinuous integrationTechnical DebtAI Governancesoftware refactoringAgent Evaluation
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.