Artificial Intelligence 13 min read

How scLong’s Billion‑Parameter Model Reads the Whole Single‑Cell Transcriptome

The scLong foundation model, trained on 48 million cells and 28 k genes, integrates full‑gene expression with Gene Ontology knowledge to outperform existing methods on genetic perturbation, chemical response, cancer drug prediction, gene‑regulatory network inference, and batch integration tasks.

Data Party THU

Mar 22, 2026

How scLong’s Billion‑Parameter Model Reads the Whole Single‑Cell Transcriptome

Background

Single‑cell transcriptomics aims to infer cell states, regulatory relationships, and predict cellular responses to genetic or chemical perturbations from gene‑expression profiles. Existing foundation models typically restrict attention to a few thousand high‑expression genes to reduce computational cost, ignoring the majority of low‑ or zero‑expression genes and lacking systematic integration of external biological knowledge such as Gene Ontology (GO).

Model Overview (scLong)

scLong is a 1‑billion‑parameter foundation model pretrained on ~48 million human cells (≈1 618 public datasets covering >50 tissue types). It models the full human transcriptome (~27 874 genes per cell), including protein‑coding and non‑coding genes, and incorporates GO annotations as structured priors.

Gene‑as‑Token Representation

Each cell is treated as a long sentence. A “word” consists of a gene identifier concatenated with its expression value . Two encoders first map the numeric expression to a vector and generate a biologically‑aware gene embedding; the sum of these vectors forms the initial token representation.

token_i = embed_gene(gene_id_i) + embed_expression(expr_i)

Dual‑Encoder Architecture

High‑expression branch: a larger Performer encoder processes genes with high read counts.

Low‑expression branch: a smaller Performer encoder processes low‑ and zero‑expression genes.

Outputs of both branches are merged by a full‑length Performer that attends over the entire gene sequence, preserving genome‑wide context while keeping computational cost tractable.

GO‑Graph Integration

A gene graph is constructed where an edge connects two genes if they share GO terms in any of the three ontologies (Biological Process, Molecular Function, Cellular Component). A Graph Convolutional Network (GCN) is applied to this graph to produce enriched gene embeddings that encode functional similarity.

# Pseudo‑code for GO graph construction
for each gene g:
    GO_terms[g] = set of GO annotations
for each pair (g_i, g_j):
    if similarity(GO_terms[g_i], GO_terms[g_j]) > threshold:
        add_edge(g_i, g_j)
# GCN updates gene embeddings
gene_emb = GCN(gene_graph, init_emb)

Pre‑training Objective

scLong uses a BERT‑style masked language modeling objective adapted to expression data. Random subsets of expression values are masked (replaced with a special token or set to zero) and the model learns to reconstruct the original values, encouraging it to capture gene‑gene dependencies and functional context.

Downstream Evaluation

Genetic Perturbation Prediction

Benchmark: Norman dataset (gene‑knock‑out/over‑expression). scLong achieved:

Pearson correlation = 0.625 (Seen 0/1 scenario), surpassing GEARS (0.561) and other baselines.

Mean‑squared error = 0.170 (Seen 0/2 scenario), better than Geneformer, scGPT, scFoundation, UCE.

Higher accuracy in identifying synergistic and suppressor gene interactions.

Chemical Perturbation Prediction

Benchmark: L1000 subset. scLong outperformed Geneformer, scGPT, scFoundation, UCE, and task‑specific DeepCE across RMSE, Spearman/Pearson correlations, and Top‑100 precision, indicating superior ability to forecast transcriptional responses to novel compounds.

Cancer Drug Response Prediction

Benchmark: DeepCDR dataset. scLong reached Pearson = 0.878 , exceeding Geneformer (0.852), scFoundation (0.867), DeepCDR (0.837), and a linear baseline (0.746). In out‑of‑distribution drug‑combination tests, scLong obtained AUROC = 0.652 , outperforming competing models.

Gene Regulatory Network (GRN) Inference

Using learned gene‑similarity embeddings, scLong achieved AUPR = 1.35 , markedly higher than Geneformer, scGPT, scFoundation, UCE, DeepSEM, GENIE3, and GO‑only baselines.

Batch Integration (Zero‑Shot)

On a pancreas dataset without any fine‑tuning, scLong obtained batch ASW = 0.96 , surpassing Raw, HVG, scVI, and other foundation models, demonstrating strong transferability.

Ablation Studies

Two key components were removed in separate experiments:

Omitting the low‑expression branch caused a measurable drop in all downstream metrics.

Removing the GO graph (i.e., training without functional priors) also reduced performance.

These results confirm that full‑gene coverage and explicit GO integration are essential for scLong’s gains.

Implications

scLong’s ability to model the complete transcriptome and embed structured biological knowledge enables:

Accurate prediction of transcriptional effects of single‑gene or combinatorial perturbations, reducing the need for extensive wet‑lab screens.

Rapid in‑silico assessment of chemical and cancer‑drug responses, supporting drug discovery and precision medicine.

Improved reconstruction of gene regulatory networks that reflect both data‑driven co‑expression and known functional relationships.

Robust, batch‑agnostic cell representations useful for integrating heterogeneous single‑cell datasets.

Overall, scLong demonstrates that scaling foundation models to billions of parameters while incorporating full‑genome context and domain‑specific priors yields a versatile tool for systems biology and biomedical AI.

Reference: https://www.nature.com/articles/s41467-026-69102-y

bioinformatics foundation model gene ontology scLong single-cell