EDEN Models Leverage a Million Species and 10‑Billion‑Scale Gene Data to Reach SOTA Genome & Protein Prediction
The EDEN series of foundation models, trained on the massive BaseData macro‑genomic dataset covering over one million species and 9.7 trillion nucleotides, achieve state‑of‑the‑art genome and protein prediction while enabling functional recombinase design, antimicrobial peptide generation, and synthetic microbiome construction with minimal task‑specific data.
BaseData Dataset: A High‑Quality, Large‑Scale Genomic Benchmark
The research uses the BaseData dataset, which breaks the limits of traditional biological databases by providing 9.7 trillion nucleotide tokens, covering more than 1 million new species and 100 billion new genes. Unlike random collections, BaseData is enriched with environmental metagenomes, phages, and mobile genetic elements, capturing evolutionary signals such as phage‑host interactions and horizontal gene transfer.
Compared with the widely used OpenGenome‑2 (OG2), BaseData’s median contig length reaches 18.6 kbp (OG2: 4.0 kbp) and each assembly contains significantly more genes, offering richer long‑range context for model learning.
Scaling Experiments Demonstrate a Quality‑Aware Scaling Law
Training models of equal size on BaseData and OG2 shows that, under the same compute budget, BaseData‑trained models reduce perplexity faster, confirming a "quality‑aware scaling law." Large models (e.g., 7 B parameters) fully exploit BaseData’s long‑range information, ultimately outperforming their OG2‑trained counterparts.
EDEN Model Family: Architecture and Training
EDEN models span 1 B to 28 B parameters and adopt a decoder‑only Transformer architecture validated on large‑language models, specifically following the Llama 3.1 design. The flagship EDEN‑28B contains 48 layers, a hidden dimension of 6 144, 48 attention heads, SwiGLU activation, and RoPE positional encoding. Tokenization operates at single‑nucleotide resolution with a 512‑token vocabulary.
Although the context window is set to 8 192 tokens, the model reliably generates and stitches together coherent genomic sequences exceeding 13 000 base pairs, preserving correct gene order, reading frames, and regulatory element structures. Training was performed on 1 008 NVIDIA H100 GPUs.
Pre‑training and Fine‑tuning Paradigm
EDEN follows a "pre‑train → fine‑tune" workflow. In the first stage, the model learns universal biological design principles from BaseData, internalizing knowledge about protein folding, metabolic pathway assembly, and other design rules. In the second stage, lightweight fine‑tuning on a small, high‑quality task‑specific dataset enables rapid adaptation to distinct therapeutic design tasks.
Experimental Validation Across Four Biological Scales
AI‑Programmable Gene Insertion (aiPGI): By fine‑tuning on millions of LSR‑attachment site pairs, the EDEN‑LSR model generated functional large‑serine recombinases for ten disease‑related loci and four safe‑harbor sites. Functional hit rate reached 53.6 %; 50 % of the designed enzymes achieved therapeutic‑level CAR gene insertion in primary human T cells, with some variants attaining up to 40 % integration efficiency.
Bridge Recombinases (BR): The EDEN‑BR model, fine‑tuned on millions of genomic regions containing bridge recombinases, produced 49 candidate sequences. Two candidates (DF3843 and DF3881) displayed clear recombinase activity despite only 85 % and 65.8 % sequence similarity to any known BR and <35 % similarity to the reference ISCro4, yet they matched the reference protein’s 3‑D structure, proving the model captures core structural logic beyond simple sequence imitation.
Antimicrobial Peptide (AMP) Design: Using a context‑aware fine‑tuning strategy, EDEN generated a library of 33 peptides. Laboratory testing showed 97 % of the sequences possessed antimicrobial activity, with top candidates achieving micromolar‑level minimum inhibitory concentrations against multi‑drug‑resistant Gram‑negative bacteria such as Acinetobacter baumannii. The designed peptides exhibited low similarity to existing databases, confirming true de‑novo design capability.
Synthetic Microbiome Construction: After fine‑tuning on gut‑microbiome data, EDEN generated a synthetic macro‑genome containing over 90 000 species and a total size of a gigabase. 99 % of the species were correctly classified as gut‑associated, and the generated metabolic pathways aligned with known cross‑species interactions. The model also recreated host‑integrated prophage structures, demonstrating capture of fine‑grained host‑virus interaction logic.
Implications and Broader Context
These four cross‑scale experiments collectively demonstrate that a single foundation model, pre‑trained on unified evolutionary data, can serve as a universal biological design engine, requiring only minimal task‑specific data to drive innovations from molecular engineering to ecosystem‑level synthesis. The work is reported in the pre‑print "Designing AI‑programmable therapeutics with the EDEN family of foundation models" (bioRxiv DOI: https://doi.org/10.64898/2026.01.12.699009).
Recent advances such as AlphaFold 3, NVIDIA’s BioNeMo, and Ginkgo Bioworks’ automated platforms illustrate a broader trend of AI‑driven synthetic biology, where data‑rich, high‑capacity models transform biology from descriptive science into an engineering discipline capable of addressing global health, environmental, and resource challenges.
HyperAI Super Neural
Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
