How French Researchers Used Deep Learning to Predict 2.39 Million Anti‑Phage Proteins and Map Bacterial Immunity
A French team at the Pasteur Institute built three complementary deep‑learning models—ALBERT_DF, ESM_DF, and GeneCLR_DF—to predict anti‑phage proteins at genome scale, achieving 99% precision and 92% recall, and uncovered roughly 2.39 million candidate proteins and 23 000 novel operon families, dramatically expanding the known bacterial antiviral repertoire.
Background and Motivation
Bacteria and bacteriophages are locked in an ongoing arms race, with phages outnumbering bacteria roughly ten‑to‑one. While over 250 anti‑phage systems have been experimentally validated, traditional experimental and computational methods miss many potential defenses hidden in bacterial genomes.
Previous studies noted recurring domain patterns and enrichment of defense genes in "defense islands" or prophage regions, suggesting that systematic genome‑wide pattern mining could reveal unknown systems.
Model Development
The Pasteur Institute researchers created three complementary deep‑learning models:
ALBERT_DF : learns from local genomic context by treating protein families as words and gene neighborhoods as sentences, using an ALBERT‑style architecture.
ESM_DF : leverages a protein‑language model to capture residue‑level co‑variation and long‑range sequence relationships.
GeneCLR_DF : combines sequence and genomic‑context embeddings via contrastive learning, aligning the two representations for each gene.
These models address the limitations of the traditional "defense‑score" method, which requires at least five homologs per family and overlooks defenses outside typical islands.
Dataset Construction
Using DefenseFinder v1.3 and PadLoc, the team scanned 32,798 RefSeq bacterial genomes (≈1.23 × 10⁸ proteins). DefenseFinder identified 521,360 proteins (0.4%) and PadLoc 805,357 proteins (0.65%) as known anti‑phage components.
To train the models, they built the Gembase_DF dataset focused on Actinobacteria: 10,796 genomes, 4.2 × 10⁶ protein families clustered, with a vocabulary of the 524,288 most common families covering ~89% of proteins.
For ESM_DF and GeneCLR_DF, the Gembase_DF dataset (Gembase_DF) used 521,360 DefenseFinder‑annotated proteins as positives, 1.16 × 10⁸ highly conserved core genes and 1.4 × 10⁷ mobile‑element genes as negatives, and retained the remaining proteins as unlabeled candidates.
Data splits were constructed so that all proteins from the same defense system fell into the same fold, and MMseqs2 removed residual cross‑fold homology to prevent information leakage.
Benchmark Results
On a unified benchmark, GeneCLR_DF achieved the best performance: 99% precision and 92% recall. ALBERT_DF predicted 1,930 candidate families, 33% overlapping with defense‑score results; experimental validation of ten candidates in Streptomyces showed six providing >100‑fold protection against 12 phages.
ESM_DF validated six candidates in E. coli, confirming both known domain variants and novel DUF7946‑type proteins, demonstrating broader functional capture beyond strict sequence homology.
GeneCLR_DF distinguished defense from non‑defense proteins with a clear score separation, assigning high scores to reverse‑transcriptase, CBASS, and Thoeris branches, while ESM‑650M_DF identified only a subset.
At a threshold of –0.74, GeneCLR_DF reached 99% precision and 92.4% recall; at the same precision, ESM_DF recalled only 58%. With a 1% false‑positive rate, GeneCLR_DF retrieved 94% of known families (vs. 35% for ESM‑650M_DF and 5% for defense‑score), discovering 56% of families uniquely.
Large‑Scale Prediction and Biological Insights
Applying GeneCLR_DF to 32,000+ bacterial genomes yielded ~2.39 million predicted anti‑phage proteins. On average, 1.5% of genes in a typical genome participate in antiviral defense. Over 85% of predicted protein families had no prior immunity annotation, and ~23,000 operon families were defined, most being novel.
Operon‑level analysis showed the median proportion of defense genes rising from 0.46% to 1.53% of a genome. Approximately 23.5% of predicted defenses reside on mobile‑genetic‑element boundaries, and 47.1% of satellite elements were predicted to encode defense functions.
GeneCLR_DF expanded defense‑related Pfam families from 934 to 3,154 (≈15% of all Pfam). More than 400,000 predicted families lack any Pfam annotation; <5% appear in DefenseFinder, and >3,500 operon families consist solely of proteins with no known domains, highlighting a vast unexplored molecular space.
Limitations and Future Directions
The original defense‑score approach cannot handle families with fewer than five homologs and misses defenses outside typical islands. ALBERT_DF’s reliance on discrete “word” vocabularies limits cross‑species generalization, while ESM_DF still depends partly on sequence similarity.
GeneCLR_DF’s complementary design mitigates these issues, but further model scaling and incorporation of additional genomic signals could improve detection of highly divergent systems.
Broader Impact
Similar strategies are emerging, such as MIT’s DefensePredictor, which integrates protein‑language models with genomic context to identify ~82% of novel systems across ~17,000 prokaryotic genomes. In industry, companies like Locus Biosciences and Micreos are leveraging machine‑learning‑guided phage engineering for therapeutic and food‑safety applications, illustrating the translational potential of computational anti‑phage discovery.
The study, titled "Protein and genomic language models uncover the unexplored diversity of bacterial immunity" (Science), demonstrates how deep‑learning‑driven systematic mining can dramatically accelerate the mapping of bacterial antiviral defenses.
HyperAI Super Neural
Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
