Artificial Intelligence 8 min read

Can AI Generate Full Repositories from a README? Inside Microsoft’s RepoGenesis Benchmark

RepoGenesis, a new ACL 2026 benchmark introduced by Microsoft Research, evaluates whether large‑language‑model agents can turn a structured README into a complete, deployable microservice repository, measuring Pass@1, API coverage and deployment success across 106 Python and Java projects.

Machine Learning Algorithms & Natural Language Processing

Apr 16, 2026

Can AI Generate Full Repositories from a README? Inside Microsoft’s RepoGenesis Benchmark

Background and Goal

Large‑language‑model (LLM) code generation has progressed from writing isolated snippets to tackling real‑world engineering tasks. The most challenging step is the "0‑to‑1" problem: given a requirement document (README), can an AI produce a fully functional, deployable codebase?

RepoGenesis Benchmark

Microsoft Research proposes RepoGenesis , the first end‑to‑end benchmark for multi‑language, repository‑level microservice generation. The benchmark supplies a structured README as input and expects the model or agent to output an entire repository—including source files, configuration, and dependency declarations—that passes a black‑box test suite.

The dataset comprises 106 repositories (60 Python, 46 Java) spanning 18 domains and 11 frameworks , with a total of 1,258 APIs and 2,335 test cases . A Verified subset of 30 repositories (6 real GitHub projects + 24 expert‑curated) is used for evaluation, while a Train subset of 76 repositories provides training and trajectory distillation data.

Evaluation Protocol

Three metrics are reported simultaneously:

Pass@1 : functional correctness under black‑box testing.

API Coverage (AC) : proportion of required interfaces implemented.

Deployment Success Rate (DSR) : ability of the generated repository to be built and run.

To ensure rigorous assessment, the authors adopt an ACL‑style review‑rebuttal process: blind multi‑model evaluation, Area Chair intervention on large disagreements, and iterative refinement until a predefined Krippendorff’s α≈0.69 is reached.

Results

Compared with existing code‑generation baselines (HumanEval, SWE‑Bench, ClassEval), RepoGenesis expands the evaluation scope to the repo level and focuses on REST‑style web microservices in Python and Java.

Key findings include:

Maximum API coverage reaches ~73.91%.

Deployment success can hit 100% under certain IDE‑model configurations.

Even the strongest systems (Copilot + Claude) achieve only ~23.67% Pass@1 for Python and ~21.45% for Java.

Failure analysis shows three dominant error categories: cross‑file consistency (~50.2%), architectural coherence (~26.0%), and dependency management (~23.8%). Java projects exhibit a higher dependency‑related failure rate (44.7%).

GenesisAgent Extension

Building on MS‑Agent, the team created GenesisAgent , distilled from 16,396 high‑quality instruction‑following samples generated by the verified pipeline. Fine‑tuning Qwen‑3‑8B on these samples yields GenesisAgent‑8B , which matches GPT‑5 mini on the three metrics, demonstrating the benchmark’s value as a training signal.

Limitations

RepoGenesis focuses on REST‑style microservices written in Python or Java and assumes well‑structured READMEs; real‑world ambiguous or evolving specifications are not fully represented. The benchmark primarily measures pass/fail outcomes, leaving readability, long‑term maintainability, and engineering best practices unquantified.

Conclusion

RepoGenesis reframes code generation research by providing a reproducible, comparable, and improvable testbed for the critical "document‑to‑repository" step, encouraging the next generation of LLM agents to address full‑stack software engineering challenges.

Java code generation Python large language models microservice benchmark RepoGenesis

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.