How We Built a Full‑Scale Product Using Only Codex‑Generated Code

Over five months the team created an internally used product from an empty Git repository, writing every line of application logic, tests, CI configuration, documentation and tooling with OpenAI's Codex, achieving roughly one‑tenth the effort of manual coding while uncovering new engineering roles and processes.

Shi's AI Notebook
Shi's AI Notebook
Shi's AI Notebook
How We Built a Full‑Scale Product Using Only Codex‑Generated Code

Starting from an Empty Repository

The first commit was made in late August 2025. The initial scaffold—including repository layout, CI setup, formatting rules, package manager configuration and application framework—was generated by Codex CLI using GPT‑5 guided by a small set of existing templates. Even the AGENTS.md file that tells the agents how to work was written by Codex. No human‑written code existed at the start.

After five months the repository grew to about one million lines of code covering application logic, infrastructure, tools, documentation and internal developer utilities. Approximately 1,500 pull requests were opened and merged, driven by a three‑person engineering team, averaging 3.5 PRs per engineer per day. When the team expanded to seven engineers, throughput increased further. The product is now used daily by hundreds of internal users.

Redefining the Engineer’s Role

Without manual coding, engineers focused on system design, architecture and leverage. Early progress lagged not because Codex lacked ability, but because the environment lacked clear specifications, tools, abstractions and internal structure. Engineers broke high‑level goals into smaller modules—design, code, review, test—prompted the agent to build each module, and used the resulting artifacts to unlock more complex tasks.

Human interaction was almost entirely through prompts: engineers described tasks, ran the agent, and let it open a pull request. To push a PR to completion, they instructed Codex to self‑review, request additional agent reviews, respond to feedback, and repeat until all reviewers were satisfied, creating an "infinite review loop". Humans could review PRs but were not required; most review work was handled agent‑to‑agent.

Improving Application Readability for the Agent

As code throughput grew, the bottleneck became human QA capacity. The team made the UI, logs and metrics directly readable by Codex. For example, the application can be started via git worktree so Codex can launch an instance for each change. Chrome DevTools Protocol was integrated into the runtime, and skills were added to handle DOM snapshots, screenshots and navigation, allowing Codex to reproduce errors, verify fixes and reason about UI behavior.

Observability tools were treated similarly: logs, metrics and traces are presented to Codex via a temporary local observability stack that disappears after a task finishes. Codex can query logs with LogQL and metrics with PromQL, enabling prompts such as "ensure service starts within 800 ms" or "no span in the four key user journeys exceeds two seconds".

Using the Repository as a Knowledge Base

Context management proved to be a major challenge. A single large AGENTS.md file was ineffective: it consumed scarce context, became stale, and was hard to verify. The team switched to a short, ~100‑line AGENTS.md that acted as a map pointing to deeper information stored in a structured docs/ directory.

AGENTS.md
ARCHITECTURE.md
docs/
├── design-docs/
│   ├── index.md
│   ├── core-beliefs.md
│   └── ...
├── exec-plans/
│   ├── active/
│   ├── completed/
│   └── tech-debt-tracker.md
├── generated/
│   └── db-schema.md
├── product-specs/
│   ├── index.md
│   ├── new-user-onboarding.md
│   └── ...
├── references/
│   ├── design-system-reference-llms.txt
│   ├── nixpacks-llms.txt
│   ├── uv-llms.txt
│   └── ...
├── DESIGN.md
├── FRONTEND-ARCHITECTURE.md
└── ...

Design documents are indexed with validation status and core principles for agent‑first operation. Architecture documents provide top‑level domain and package layering maps. Plans are version‑controlled artifacts that guide agents without external context.

Enforcing Architecture and Taste

To keep a fully agent‑generated codebase coherent, the team enforced invariants via custom linters and structural tests rather than micromanaging implementation. For example, Codex must parse data shapes at boundaries but may choose any library (Zod was preferred but not mandated).

Each business domain follows a strict layer order (Types → Config → Repo → Service → Runtime → UI) with cross‑cutting concerns (auth, connectors, telemetry, feature flags) entering through a single Providers interface. Violations are automatically rejected by the linter, turning architectural rules into speed multipliers for agents.

Throughput Changes Merge Philosophy

High agent throughput made traditional merge gates impractical. Pull‑request lifecycles are short; flaky tests are rerun rather than blocking progress. In this regime, error‑correction cost is low while waiting cost is high, justifying a more aggressive merge policy.

What "Agent‑Generated" Really Means

All artifacts—product code, tests, CI pipelines, internal tools, documentation, design history, evaluation tools, review comments, repository scripts, and dashboard definitions—are produced by Codex. Humans remain in the loop for higher‑level intent, feedback handling, and occasional judgment calls.

Increasing Autonomy

As more development steps (testing, verification, review, feedback handling, recovery) were encoded, the repository crossed a threshold where Codex could drive an end‑to‑end feature. Given a prompt, the agent can verify repository state, reproduce a bug, record a failure video, implement a fix, validate it by running the app, record a solution video, open a PR, respond to feedback, fix build failures, involve humans only when necessary, and finally merge the change.

Entropy and Garbage Collection

Fully autonomous agents also reproduce existing suboptimal patterns, leading to drift. Initially humans spent 20 % of Friday time cleaning "AI residue". The team encoded a "golden principle" set into the repo and built a cyclic cleanup process: mechanical rules that scan for drift, update quality grades, and open targeted refactor PRs, most of which are reviewed and merged within a minute.

Open Questions

The approach has worked well internally at OpenAI, but the long‑term evolution of architectural coherence in a wholly agent‑generated system remains unknown. The team is still learning where human judgment adds the most value and how to encode it, as well as how the system will evolve as model capabilities improve.

Discipline now resides more in supporting structures—tools, abstractions, feedback loops—than in hand‑written code. The biggest current challenges are designing environments, feedback mechanisms and control systems that enable agents to build and maintain large, reliable software at scale.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Observabilitysoftware engineeringcontinuous integrationCodexAI coding agentsrepo automation
Shi's AI Notebook
Written by

Shi's AI Notebook

AI technology observer documenting AI evolution and industry news, sharing development practices.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.