Can Agents Go Beyond Reporting? They Now Rewrite Code and Submit Their Own PRs
The article explains how AI agents can run overnight tests, automatically detect faulty modules, modify production code, and open pull requests, creating a closed-loop evaluation system that shifts testing from post‑hoc error spotting to proactive code iteration, provided three key prerequisites are met.
AI agents have reached the point where they can modify their own code. An agent runs tests overnight, parses execution logs line by line, pinpoints the problematic module, edits production‑grade code, verifies the fix, and opens a pull request. The next morning the team receives a high‑priority bug list and a ready‑to‑merge PR without any human intervention.
From Evaluation to Self‑Iterating Development
The role of evaluation changes from a post‑mortem examiner to a steering wheel that drives code self‑iteration. When a test exposes a bad case, the agent dives into the repository, fixes the issue, submits a PR, and the new version is fed back into the evaluation cycle, forming a fully automated loop.
Automatic Iteration Loop
Evaluation exposes a bad case → Agent enters codebase → Analyzes prompt, architecture, and orchestration code → Directly modifies harness/prompt/code → Runs the fix and opens a pull request → New version returns to evaluation → Repeat.
Why It Works for Agents
Improving an agent often involves changing the harness, prompt, or orchestration rather than retraining model weights, which are precisely the aspects agents excel at modifying.
Three Prerequisites for a Working Closed Loop
Automatic Scoring (Agent‑as‑Judge): Replace a fixed prompt with an agent that analyses each trace, compares it against top‑tier models, and aggregates results via multi‑model voting (e.g., GPT‑5, Claude). Scoring accuracy ranges from 90 % to 95 % and variance can be compensated downstream.
High‑Quality Evaluation Set: The set must be slightly ahead of the training data, exposing what the model cannot yet handle. Difficulty distribution is deliberately skewed: 40 % easy‑medium, 20‑30 % hard, and 10‑20 % extreme cases (the “donkey‑chasing” principle). Real data should dominate, with synthetic data limited to 20‑30 %.
Self‑Built Evaluation Platform Integrated with the Codebase: The platform must be tightly coupled with the repository and harness; third‑party services like LangSmith or LangFuse cannot provide the required deep integration. When a trace lands, it triggers scoring, which immediately routes the identified problem to a debugging agent that patches the repository. Only a self‑developed platform can achieve this end‑to‑end coupling.
Beyond Labor Savings
When the three conditions are satisfied, the system does more than automate testing—it transforms evaluation from a reactive error‑spotting step into a proactive development driver. AI begins to set its own direction, rewrite code, and verify changes autonomously, lowering the entry barrier for teams beyond algorithm specialists.
Further Learning
The author, Zhang He, a former early‑member of Xiaomi AI Lab, will give a deeper public talk covering the hardest engineering details: how the debugging agent connects to the codebase, the implementation of multi‑model voting for Agent‑as‑Judge, and the integration of a custom evaluation platform with the harness, presented from both engineering and model‑training perspectives.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI Engineering
Focused on cutting‑edge product and technology information and practical experience sharing in the AI field (large models, MLOps/LLMOps, AI application development, AI infrastructure).
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
