Artificial Intelligence 18 min read

Can AI Agents Fully Automate Medium‑Complex GitHub Issues in 10 Minutes?

This article analyzes how the AutoResearch method pioneered by Andrej Karpathy was adapted to software development, detailing three key enhancements—multi‑agent cross‑review, a five‑dimensional scoring system, and feedback‑driven iteration—that enable a fully autonomous pipeline capable of completing a medium‑complexity issue in about ten minutes with a 9.0/10 code quality score.

BirdNest Tech Talk

Apr 20, 2026

Can AI Agents Fully Automate Medium‑Complex GitHub Issues in 10 Minutes?

Karpathy’s autoresearch (released March 2026) is a 600‑line Python tool that automates a tiny LLM training loop: an AI agent repeatedly edits train.py, runs a 5‑minute experiment on a single GPU, and commits only when validation loss improves; otherwise it reverts. The loop is driven by three principles: (1) quantify the objective (validation loss), (2) run an autonomous iteration, (3) keep only improvements.

Adapting the loop to software development

The traditional "human writes code → run tests → fix bugs" workflow collapses when dozens of GitHub Issues accumulate. By replacing the training loop with "implement Issue → run tests → score ≥ 9.0 → merge", the system can resolve a medium‑complexity issue in roughly ten minutes with zero human intervention.

Key enhancements over the original AutoResearch

Multi‑agent cross‑review : Codex and Claude alternate as implementer and reviewer (odd rounds: Codex implements, Claude reviews; even rounds: Claude implements, Codex reviews). This mitigates each model’s blind spots and yields higher code quality than a single‑agent setup.

Five‑dimensional weighted scoring : The single metric (validation loss) is replaced by a weighted sum of five dimensions—Correctness (35 %), Tests (25 %), Code Quality (20 %), Security (10 %), Performance (10 %). A total score of 9.0 / 10 triggers automatic PR creation.

Feedback‑driven iteration : When a round fails, the exact feedback (e.g., "test failed") is injected into the next agent’s prompt, enabling targeted fixes instead of blind retries.

System architecture

The repository https://github.com/smallnest/autoresearch contains the core files:

autoresearch/
├── program.md          # "constitution" defining rules, permissions, quality standards
├── issue-selector.md   # Issue prioritization and exclusion strategy
├── run.sh              # Orchestration script
├── agents/
│   ├── codex.md        # Codex role: implementation instructions, lint checklist
│   ├── claude.md       # Claude role: review criteria, scoring template
│   └── gemini.md       # Optional third agent
├── workflows/          # Auto‑generated per‑issue logs
└── results.tsv         # Aggregated results

The workflow proceeds through four phases:

Phase 1 – Environment preparation : verify dependencies (GitHub CLI gh, acpx, Go), fetch Issue data, create a feature branch.

Phase 2 – Core iteration loop : odd rounds run Codex → Claude, even rounds run Claude → Codex. Each round performs implementation, testing, scoring, and either proceeds to Phase 3 (if score ≥ 9.0) or feeds back for another round.

Phase 3 – Automatic commit & PR : commit, push, create PR via gh pr create, then merge.

Phase 4 – Archiving : write iteration logs to results.tsv and per‑issue log.md for traceability.

Scoring methodology

Raw findings are mapped to a 0‑10 scale (no issue = 10, suggestion = 9, minor = 7, severe = 4, fatal = 1). The weighted formula is:

Score = 0.35·Correctness + 0.25·Tests + 0.20·CodeQuality + 0.10·Security + 0.10·Performance

Only when the total reaches 9.0 does the system auto‑merge the PR.

Issue selection & prioritization

Issues are excluded if they carry any of the following labels or markers: wontfix, duplicate, invalid, blocked, needs discussion, on hold, external, titles containing [WIP] or [DRAFT], body containing DO NOT IMPLEMENT, or already linked to a PR.

Priority is computed as:

Priority = Base(15) + TagWeight + TypeWeight + TimeFactor

TagWeight: critical 100 > high 50 > medium 20 > low 10

TypeWeight: bug 30 > feature 20 > refactor 10 > test 5 > docs 3

TimeFactor: new +10 / old +15 / recent update +5

Running the system

Prerequisites:

GitHub CLI ( gh) with authentication configured. acpx – the command‑line bridge that drives Claude Code and Codex.

Go toolchain (required because the target project is written in Go).

Typical invocation:

# Process Issue #21
/path/to/autoresearch/run.sh 21
# Limit the loop to a maximum of 10 iterations
/path/to/autoresearch/run.sh 21 10

The script performs the full pipeline: environment checks → Issue fetch → branch creation → alternating Codex/Claude implementation & review → scoring → automatic PR & merge when the threshold is met.

Real‑world case studies

Issue #21 – add job timeout & retries (medium complexity)

Iter 1 (Codex): score 1.0 – only read existing code, no functionality.
Iter 2 (Claude): score 5.0 – added timeout control but still incomplete.
Iter 3 (Codex): score 9.0 – passed, auto‑commit & PR.
Changes: added <code>Timeout</code> field to <code>job.go</code>, new tests in <code>job_test.go</code>, REST API enhancements.
Total time ≈ 10 min, 3 iterations.

Issue #15 – define source‑of‑truth event protocol (feature)

Iter 1 (Codex): score 5.0 – design feedback.
Iter 1 (Claude): score 7.0 – implementation detail feedback.
Iter 2 (Codex): score 9.1 – threshold reached, auto‑PR merged.
Total iterations: 2 (odd round Codex implements, Claude reviews; even round Claude implements, Codex reviews).

Issue #6 – add web UI for sessions (high complexity)

Iterations: 5 rounds.
Final score: 15/10 (both agents gave maximum scores).
Result: automatic PR and merge.

Best practices

Start with small, low‑complexity issues to validate the pipeline.

Keep program.md up‑to‑date; adjust rules or weightings as project needs evolve.

Monitor score trends in per‑issue log.md to ensure steady improvement.

Leverage the cross‑review between Codex and Claude to catch blind spots.

Use exponential back‑off for flaky API calls (max 60 s, up to 10 retries).

Configure a failure‑counter; abort after three consecutive agent failures to avoid endless loops.

The system demonstrates that a well‑engineered multi‑agent loop with quantitative, multi‑dimensional scoring can replace most manual coding steps, delivering high‑quality code at scale with near‑zero human intervention.

AI agents continuous integration GitHub software automation auto code generation multi‑agent review

Written by

BirdNest Tech Talk

Author of the rpcx microservice framework, original book author, and chair of Baidu's Go CMC committee.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.