How Evo AI Created the World’s First Fully‑Designed Phage Genome
Researchers at the Arc Institute and Stanford unveiled Evo 2, an AI model capable of designing entire viral genomes, and demonstrated its power by generating the first AI‑crafted ΦX174 phage genome, detailing the annotation pipeline, fine‑tuning, validation, and evolutionary insights.
Background
Arc Institute and Stanford University released Evo 1 in 2024 and Evo 2 in early 2025. Evo 2 is a generative AI model that predicts DNA, RNA and protein sequences for all known species, extending the capabilities of the earlier Evo 1 model.
Design Target: Bacteriophage ΦX174
ΦX174 is a 5,386‑nt single‑stranded DNA phage that encodes 11 overlapping genes. Its compact size matches current DNA‑synthesis cost limits, while the overlapping‑gene architecture provides a stringent test of genome‑scale design. Historically it was the first genome sequenced (1977) and the first to be chemically synthesized (2003), making it an ideal benchmark for AI‑driven genome engineering.
Custom Gene‑Annotation Pipeline
Standard gene‑prediction tools identify only 7 of the 11 overlapping genes in ΦX174. To evaluate thousands of AI‑generated sequences, the authors built a pipeline that (1) scans all open reading frames, (2) performs homology searches against a curated phage protein database, and (3) retains any ORF that matches at least one known ΦX174 protein. This approach recovers all 11 genes and imposes a quality filter requiring a minimum of seven native ΦX174 proteins in each candidate genome.
Fine‑Tuning Evo for Phage Generation
Although the base Evo model was pre‑trained on >2 million viral genomes, it lacked the precision to produce ΦX174‑like genomes. The team performed supervised fine‑tuning using 14,466 micro‑virus sequences, enabling Evo 2 to generate variants that closely resemble ΦX174 while preserving functional constraints.
Genome Generation and Multi‑Tier Screening
Thousands of candidate genomes were sampled from the fine‑tuned model and passed through a three‑tier screening system:
Sequence quality : verification of complete ORF set and absence of frameshifts.
Host specificity : prediction that the phage can infect non‑pathogenic E. coli C strain and not other tested bacterial strains.
Evolutionary novelty : measurement of nucleotide divergence from known ΦX174 isolates to ensure exploration of unexplored sequence space.
Sixteen candidates satisfied all criteria, showing both target‑host specificity and significant divergence from natural phages.
Experimental Validation Workflow
Each candidate genome was synthesized by Gibson assembly, transformed into competent E. coli C cells, and screened in 96‑well plates using a growth‑inhibition assay. Optical density at 600 nm (OD₆₀₀) was monitored for 2–3 hours; a rapid decline indicated successful infection and lysis. Out of 285 assembled genomes, 16 caused growth inhibition, were confirmed by sequencing, amplified, and subjected to further phenotypic characterization.
Key Findings
Functional genomes carried 67–392 novel mutations relative to their closest natural counterpart.
Design Evo‑Φ2147 contained 392 mutations and shared 93 % nucleotide identity with the NC51 phage, meeting criteria for a new species.
Thirteen genomes possessed mutations absent from any known natural sequence, demonstrating the model’s ability to explore previously unsampled evolutionary space.
Design Evo‑Φ36 incorporated the DNA‑packaging J protein from the distant G4 phage. Cryo‑EM revealed a distinct orientation of the shorter G4 J protein within the capsid, illustrating coordinated compensatory mutations that preserve functionality.
Conclusions
The study shows that genome‑scale language models, when fine‑tuned and guided by rigorous annotation and screening pipelines, can capture evolutionary constraints and enable the design of functional whole genomes. As model capabilities improve and DNA synthesis becomes cheaper, AI‑driven genome engineering is poised to expand biotechnological applications and fundamental research.
Preprint: https://www.biorxiv.org/content/10.1101/2025.09.12.675911v1
Code example
来源:ScienceAI
本文
约2000字
,建议阅读
5
分钟
从读取基因组,到编写基因组,再到设计基因组,生物学研究将开启新的篇章。Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
