LucaOne: Unified Nucleic Acid & Protein Language Model Surpasses Other Models

Researchers present LucaOne, a Transformer‑based foundation model that unifies DNA/RNA and protein sequences using a 39‑token vocabulary, rotary positional encoding, and molecule‑type embeddings, and demonstrate through extensive multi‑task benchmarks that it outperforms domain‑specific models across seven biological tasks.

Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
LucaOne: Unified Nucleic Acid & Protein Language Model Surpasses Other Models

Model Architecture

The study introduces LucaOne, a Transformer‑based biological foundation model that unifies nucleic acid (DNA/RNA) and protein sequences. It employs a shared 39‑token vocabulary covering the four nucleotides and the twenty standard amino acids. To handle long sequences, rotary positional encoding replaces absolute encoding, and pre‑layer normalization improves training stability. The architecture contains twenty Transformer encoder blocks, each with an embedding dimension of 2560, totaling 1.8 billion parameters and supporting sequences up to length 1280. A molecule‑type encoding (0 for nucleic acids, 1 for proteins) distinguishes the two sequence modalities.

Training Strategy

The authors assembled a large‑scale training dataset spanning 169,000 species and adopted a semi‑supervised learning scheme that blends self‑supervised and supervised signals. Primary pre‑training tasks include masked language modeling for both nucleic acids and proteins. Additionally, eight annotation‑driven semi‑supervised tasks cover genomic region identification (e.g., CDS, introns), protein functional site prediction, and domain recognition. After processing 3.695 billion tokens, the model learns not only statistical patterns but also biological semantic information, automatically aligning nucleic‑acid and protein embeddings of the same gene in latent space.

Multi‑Task Evaluation

LucaOne was evaluated on seven downstream biological computation tasks. On the nucleic‑acid classification benchmark GenusTax, it achieved a 5 % higher accuracy than DNABert2; on the non‑coding RNA family classification (ncRNAFam) it improved accuracy by 2.6 %. For cross‑modal tasks such as central‑dogma validation and ncRNA‑protein interaction prediction, the unified model outperformed the combination of DNABert2 and ESM2‑3B, confirming the benefit of joint multimodal training. The model also demonstrated strong generalization on low‑homology sequences, attaining 100 % accuracy on the influenza antigen prediction task (InfA).

Conclusion and Future Directions

The research establishes a new standard for multimodal biological sequence analysis by providing a pre‑trained embedding that supports diverse tasks while reducing reliance on large labeled datasets. The authors suggest extending the framework to incorporate additional modalities such as chromatin states and DNA methylation, and to design more efficient cross‑modal attention mechanisms for capturing complex inter‑modal interactions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

transformermultimodalbioinformaticsfoundation modelDNAprotein
Network Intelligence Research Center (NIRC)
Written by

Network Intelligence Research Center (NIRC)

NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.