Can AI Agents Build a Million‑Line Codebase in One‑Fifth the Time?
The article details how a three‑engineer team used OpenAI's Codex agents to generate an entire production‑ready software stack—including over a million lines of code, 1,500 pull requests, and a full CI/CD pipeline—in roughly one‑tenth the effort of manual coding, while describing the architectural, operational, and organizational adjustments required for such agent‑first development.
Over five months the team built and shipped an internal beta of a software product without a single line of human‑written code. Codex generated everything—from application logic, tests, CI configuration, documentation, observability, to internal developer tools—resulting in an estimated effort of only 1/10 of traditional development.
Starting from an Empty Repository
The first commit appeared in late August 2025. Using a small set of existing templates, Codex CLI (powered by GPT‑5) generated the repository layout, CI settings, formatting rules, package manager configuration, and the application framework. Even the initial AGENTS.md file, which describes how agents should operate in the repo, was produced by Codex.
Within five months the repository grew to about one million lines of code , covering product code, tests, CI pipelines, internal tools, and documentation. Approximately 1,500 pull requests were opened and merged, driven by a three‑person engineering team, yielding an average throughput of 3.5 PRs per engineer per day . When the team expanded to seven engineers, throughput continued to increase.
Engineers’ Role Shift
Because no code was written manually, engineers focused on system design, architecture, and leverage points. Early progress lagged not due to Codex’s capabilities but because the environment lacked clear specifications, tooling, and abstractions. The team’s primary task became helping agents accomplish useful work by breaking high‑level goals into smaller modules (design, code, review, test) and prompting agents to build each piece.
When agents stalled, engineers asked, “What capability is still missing, and how can we make it clear and enforceable for the agent?” Interaction was almost entirely through prompts: describing a task, letting the agent open a PR, having the agent self‑review, requesting additional agent reviews, and iterating until the PR satisfied all reviewers—a process the authors call the Ralph Wiggum loop [3].
Making the Application Agent‑Readable
As agent throughput grew, human QA became the bottleneck. The team exposed UI, logs, and metrics in a form directly consumable by Codex. By integrating the Chrome DevTools protocol, agents could capture DOM snapshots, screenshots, and navigation, allowing them to reproduce bugs, verify fixes, and reason about UI behavior.
The observability stack (logs, metrics, traces) was presented to agents via local LogQL and PromQL queries, enabling prompts such as “ensure service starts within 800 ms” or “no user journey span exceeds two seconds.” Individual Codex runs on a single task often lasted over six hours , typically spanning a human’s sleep period.
Turning the Repository into a Knowledge System
Instead of a monolithic AGENTS.md, the team treated it as a table of contents pointing to a structured docs/ directory. The hierarchy includes design docs, execution plans, generated schema, product specs, and references. This progressive disclosure lets agents start from a small, stable entry point and navigate deeper as needed.
Automated linters and CI jobs verify that the knowledge base stays up‑to‑date, cross‑linked, and correctly structured. A dedicated “doc‑gardening” agent scans for stale documentation, opens corrective PRs, and ensures the repository reflects the current codebase.
Strict Architecture and “Taste”
The team enforced invariants rather than micromanaging implementation. For example, agents must resolve data shapes at boundaries but are not forced to use a specific library (they often prefer Zod). The architecture imposes a strict layer order (Types → Config → Repo → Service → Runtime → UI) with a single explicit interface for cross‑cutting concerns called Providers . Custom linters (generated by Codex) and structural tests mechanically enforce these constraints.
This disciplined architecture, usually only adopted after a team reaches hundreds of engineers, is presented as an early prerequisite for agent‑first development: constraints keep speed high and prevent architectural drift.
Throughput Changes Merge Philosophy
Higher agent throughput rendered many traditional engineering safeguards ineffective. PR lifecycles are short; occasional test failures are resolved by re‑runs rather than blocking progress. In an environment where agents can process far more changes than humans can review, the cost of errors is low, but the cost of waiting is high, so the team prefers rapid iteration.
What “Agent‑Generated” Means
Product code and tests
CI configuration and release tooling
Internal developer tools
Documentation and design history
Review comments and replies
Scripts that manage the repository itself
Production dashboard definition files
Humans remain involved, primarily translating user feedback into acceptance criteria and validating outcomes. When agents encounter a roadblock, the team treats it as a signal to add missing tools, guidance, or constraints, which the agents then implement themselves.
Increasing Autonomy
Eventually Codex achieved end‑to‑end control of a new feature, following a ten‑step workflow:
Validate the current repository state
Reproduce a reported bug
Record a video of the failure
Implement a fix
Verify the fix by running the application
Record a second video demonstrating the solution
Open a pull request
Respond to agent and human feedback
Detect and fix build failures
Merge the change only when judgment is required
The authors caution that this level of autonomy depends heavily on the repository’s specific structure and tooling and should not be assumed to generalize without similar investment.
Entropy and Garbage Collection
Fully autonomous agents eventually reproduce suboptimal patterns, causing drift. Initially, humans manually cleaned “AI residue” weekly, but this proved unscalable. The team encoded “golden principles” into the repo and introduced a continuous cleanup loop: shared utility packages replace ad‑hoc helpers, and type‑safe SDKs replace YOLO‑style data detection.
Background Codex tasks periodically scan for deviations, update quality grades, and open targeted refactor PRs—most of which are reviewed and merged automatically within a minute. This process resembles garbage collection, treating technical debt as high‑interest loans that are repaid incrementally.
Ongoing Learnings
While the strategy has worked well internally at OpenAI, the long‑term evolution of architectural coherence in a fully agent‑generated system remains uncertain. The team continues to explore where human judgment adds the most value and how to encode that judgment into the system.
Ultimately, software construction still requires discipline, but the discipline now resides more in supporting structures—tools, abstractions, and feedback loops—than in the code itself.
Qborfy AI
A knowledge base that logs daily experiences and learning journeys, sharing them with you to grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
