Can AI Agents Fully Automate Software Development? A Deep Dive into AutoResearch Adaptation

This article details how Karpathy's AutoResearch methodology was transferred to software development, introducing multi‑agent cross‑review, a five‑dimensional quantitative scoring system, and feedback‑driven iteration to build a fully automatic pipeline that resolves a medium‑complexity GitHub Issue in about ten minutes with a 9.0/10 code‑quality score.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Can AI Agents Fully Automate Software Development? A Deep Dive into AutoResearch Adaptation

The project adapts Andrej Karpathy’s AutoResearch framework—originally a minimalist Python tool for AI research—to fully automate software development tasks on GitHub.

Core Idea

Instead of modifying train.py and checking validation loss, the adapted loop processes a GitHub Issue, generates implementation code, runs unit tests, and computes a weighted score across five quality dimensions. When the total score reaches a configurable threshold (default ≥ 9.0/10), the changes are committed, a pull request is created, and the branch is merged automatically.

Key Technical Enhancements

Multi‑Agent Cross‑Review : Two LLM agents (e.g., Codex and Claude) alternate as implementer and reviewer, exposing each other’s blind spots and improving code quality.

Five‑Dimensional Weighted Scoring : Scores are calculated from correctness (35 %), test coverage (25 %), code quality (20 %), security (10 %), and performance (10 %). The weighted sum yields a total out of 10.

Feedback‑Driven Iteration : Review comments are fed back into the next iteration’s prompt, allowing the agent to address concrete issues rather than retry blindly.

System Architecture

The repository https://github.com/smallnest/autoresearch contains the following core components:

autoresearch/
├── program.md          # Constitution: rules, permissions, quality standards
├── issue-selector.md   # Issue prioritisation and exclusion rules
├── run.sh              # Orchestration script
├── agents/
│   ├── codex.md        # Codex role: implementation instructions
│   ├── claude.md       # Claude role: review instructions and scoring
│   └── gemini.md       # Optional third agent
├── workflows/          # Per‑issue logs and artifacts
└── results.tsv         # Summary of all completed issues

Four‑Phase Execution Cycle

Phase 1 – Environment Setup : Verify dependencies (GitHub CLI gh, acpx, Go), fetch the Issue, create a feature branch.

Phase 2 – Core Iteration : Agents take turns implementing and reviewing, run tests, compute the weighted score, and either continue to the next round or stop.

Phase 3 – Automatic Commit & PR : If the score ≥ threshold, the script commits, pushes, creates a PR, and merges it.

Phase 4 – Archival : Write iteration logs to results.tsv and per‑issue log.md for traceability.

Scoring Details

Each dimension is scored from 1 (fatal) to 10 (perfect). The weighted sum yields a total out of 10. Example thresholds:

No issues → 10

Suggestions → 9

General problems → 7

Severe problems → 4

Fatal problems → 1

Example Run

Running run.sh 21 on Issue #21 (a medium‑complexity feature) produced:

Iteration 1 (Codex): score 1.0 – implementation incomplete
Iteration 2 (Claude): score 5.0 – timeout control added, still missing parts
Iteration 3 (Codex): score 9.0 – all requirements met, PR auto‑merged

The entire process took ~10 minutes, required three iterations, and achieved a final score of 9.0/10.

Configuration & Extensibility

Users can customise agents, scoring weights, and issue‑selection policies by adding a .autoresearch/ directory with their own agents/ markdown files and optional workflow overrides.

Issue Selection Policy

Issues labelled wontfix, duplicate, invalid, blocked, needs discussion, on hold, external, or with titles containing [WIP] / [DRAFT], or bodies containing DO NOT IMPLEMENT, or already linked to a PR are excluded.

Priority is computed as:

score = base_weight(15) + label_weight + type_weight + time_factor

Label weight: critical = 100, high = 50, medium = 20, low = 10

Type weight: bug = 30, feature = 20, refactor = 10, test = 5, docs = 3

Time factor: new +10, stale +15, recent update +5

Program.md Rules

Defines agent permissions (e.g., can modify internal/, create test files, run tests, commit) and prohibitions (e.g., cannot modify go.mod, delete existing files, push directly, close the Issue, or edit autoresearch/ rules).

Code & Test Standards (Go)

Follow Effective Go and Go Code Review Comments.

Run gofmt, goimports, golangci-lint.

Package names lowercase, file names snake_case, exported identifiers UpperCamelCase.

All new functionality must have unit tests with ≥ 70 % coverage, using table‑driven style and avoiding time.Sleep, external dependencies, global state, or hard‑coded ports.

Robustness Features

Exponential back‑off with jitter for API failures (max 60 s, up to 10 retries).

Terminate after three consecutive agent failures.

Test failures are fed back as "test failed" prompts for the next iteration.

Running the Tool

Prerequisites: GitHub CLI ( gh), acpx (agent control), and Go toolchain.

# Verify environment
gh auth status
which acpx
go version

# Execute a single Issue (e.g., #21)
/path/to/autoresearch/run.sh 21
# Optionally limit maximum iterations
/path/to/autoresearch/run.sh 21 10

The script performs all steps automatically: environment check, Issue fetch, branch creation, alternating Codex/Claude implementation & review, scoring, and PR creation/merge when the threshold is met.

Real‑World Cases

Issue #21 (add timeout and retry logic to a job executor) completed in 3 iterations with a final score of 9.0/10. The log shows progressive improvements from an initial score of 1.0 to a passing score of 9.0, after which the PR was auto‑merged.

Issue #15 (define source‑of‑truth event protocol) reached a passing score after 2 iterations. Issue #6 (add web UI for sessions) required 5 iterations and achieved a score exceeding the threshold.

Full asciinema replay of Issue #21: https://asciinema.org/a/896260

Best Practices

Start with small, well‑scoped Issues to validate the pipeline.

Keep program.md up‑to‑date to reflect evolving quality standards.

Monitor score trends in per‑issue log.md for regressions.

Leverage the multi‑agent cross‑review to catch blind spots.

Use the built‑in exponential back‑off for transient API failures.

code generationsoftware developmentcontinuous integrationMulti-agentAI automationautoresearch
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.