Can AI Agents Build a Million‑Line Codebase in One‑Fifth the Time?

The article details how a three‑engineer team used OpenAI's Codex agents to generate an entire production‑ready software stack—including over a million lines of code, 1,500 pull requests, and a full CI/CD pipeline—in roughly one‑tenth the effort of manual coding, while describing the architectural, operational, and organizational adjustments required for such agent‑first development.

Qborfy AI
Qborfy AI
Qborfy AI
Can AI Agents Build a Million‑Line Codebase in One‑Fifth the Time?

Over five months the team built and shipped an internal beta of a software product without a single line of human‑written code. Codex generated everything—from application logic, tests, CI configuration, documentation, observability, to internal developer tools—resulting in an estimated effort of only 1/10 of traditional development.

Starting from an Empty Repository

The first commit appeared in late August 2025. Using a small set of existing templates, Codex CLI (powered by GPT‑5) generated the repository layout, CI settings, formatting rules, package manager configuration, and the application framework. Even the initial AGENTS.md file, which describes how agents should operate in the repo, was produced by Codex.

Within five months the repository grew to about one million lines of code , covering product code, tests, CI pipelines, internal tools, and documentation. Approximately 1,500 pull requests were opened and merged, driven by a three‑person engineering team, yielding an average throughput of 3.5 PRs per engineer per day . When the team expanded to seven engineers, throughput continued to increase.

Engineers’ Role Shift

Because no code was written manually, engineers focused on system design, architecture, and leverage points. Early progress lagged not due to Codex’s capabilities but because the environment lacked clear specifications, tooling, and abstractions. The team’s primary task became helping agents accomplish useful work by breaking high‑level goals into smaller modules (design, code, review, test) and prompting agents to build each piece.

When agents stalled, engineers asked, “What capability is still missing, and how can we make it clear and enforceable for the agent?” Interaction was almost entirely through prompts: describing a task, letting the agent open a PR, having the agent self‑review, requesting additional agent reviews, and iterating until the PR satisfied all reviewers—a process the authors call the Ralph Wiggum loop [3].

Making the Application Agent‑Readable

As agent throughput grew, human QA became the bottleneck. The team exposed UI, logs, and metrics in a form directly consumable by Codex. By integrating the Chrome DevTools protocol, agents could capture DOM snapshots, screenshots, and navigation, allowing them to reproduce bugs, verify fixes, and reason about UI behavior.

The observability stack (logs, metrics, traces) was presented to agents via local LogQL and PromQL queries, enabling prompts such as “ensure service starts within 800 ms” or “no user journey span exceeds two seconds.” Individual Codex runs on a single task often lasted over six hours , typically spanning a human’s sleep period.

Turning the Repository into a Knowledge System

Instead of a monolithic AGENTS.md, the team treated it as a table of contents pointing to a structured docs/ directory. The hierarchy includes design docs, execution plans, generated schema, product specs, and references. This progressive disclosure lets agents start from a small, stable entry point and navigate deeper as needed.

Automated linters and CI jobs verify that the knowledge base stays up‑to‑date, cross‑linked, and correctly structured. A dedicated “doc‑gardening” agent scans for stale documentation, opens corrective PRs, and ensures the repository reflects the current codebase.

Strict Architecture and “Taste”

The team enforced invariants rather than micromanaging implementation. For example, agents must resolve data shapes at boundaries but are not forced to use a specific library (they often prefer Zod). The architecture imposes a strict layer order (Types → Config → Repo → Service → Runtime → UI) with a single explicit interface for cross‑cutting concerns called Providers . Custom linters (generated by Codex) and structural tests mechanically enforce these constraints.

This disciplined architecture, usually only adopted after a team reaches hundreds of engineers, is presented as an early prerequisite for agent‑first development: constraints keep speed high and prevent architectural drift.

Throughput Changes Merge Philosophy

Higher agent throughput rendered many traditional engineering safeguards ineffective. PR lifecycles are short; occasional test failures are resolved by re‑runs rather than blocking progress. In an environment where agents can process far more changes than humans can review, the cost of errors is low, but the cost of waiting is high, so the team prefers rapid iteration.

What “Agent‑Generated” Means

Product code and tests

CI configuration and release tooling

Internal developer tools

Documentation and design history

Review comments and replies

Scripts that manage the repository itself

Production dashboard definition files

Humans remain involved, primarily translating user feedback into acceptance criteria and validating outcomes. When agents encounter a roadblock, the team treats it as a signal to add missing tools, guidance, or constraints, which the agents then implement themselves.

Increasing Autonomy

Eventually Codex achieved end‑to‑end control of a new feature, following a ten‑step workflow:

Validate the current repository state

Reproduce a reported bug

Record a video of the failure

Implement a fix

Verify the fix by running the application

Record a second video demonstrating the solution

Open a pull request

Respond to agent and human feedback

Detect and fix build failures

Merge the change only when judgment is required

The authors caution that this level of autonomy depends heavily on the repository’s specific structure and tooling and should not be assumed to generalize without similar investment.

Entropy and Garbage Collection

Fully autonomous agents eventually reproduce suboptimal patterns, causing drift. Initially, humans manually cleaned “AI residue” weekly, but this proved unscalable. The team encoded “golden principles” into the repo and introduced a continuous cleanup loop: shared utility packages replace ad‑hoc helpers, and type‑safe SDKs replace YOLO‑style data detection.

Background Codex tasks periodically scan for deviations, update quality grades, and open targeted refactor PRs—most of which are reviewed and merged automatically within a minute. This process resembles garbage collection, treating technical debt as high‑interest loans that are repaid incrementally.

Ongoing Learnings

While the strategy has worked well internally at OpenAI, the long‑term evolution of architectural coherence in a fully agent‑generated system remains uncertain. The team continues to explore where human judgment adds the most value and how to encode that judgment into the system.

Ultimately, software construction still requires discipline, but the discipline now resides more in supporting structures—tools, abstractions, and feedback loops—than in the code itself.

code generationAutomationAI codingsoftware engineeringcontinuous integrationagent-based development
Qborfy AI
Written by

Qborfy AI

A knowledge base that logs daily experiences and learning journeys, sharing them with you to grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.