Columbia & Stanford Launch Squidiff: Diffusion Model for Transcriptome Simulation
Squidiff, a conditional diffusion framework co‑developed by Columbia and Stanford, predicts transcriptional responses across cell differentiation, gene and drug perturbations, and radiation exposure, outperforming prior models and enabling more precise and spatially aware biomedical research.
Background and Motivation
Cellular systems are complex dissipative entities far from chemical equilibrium, and understanding how heterogeneous cell populations collectively respond to external stimuli remains a core challenge. While single‑cell RNA‑seq reveals cellular heterogeneity, accurately reconstructing whole‑transcriptome trajectories after perturbations is still difficult.
Dataset Construction
The authors assembled a comprehensive multi‑scenario dataset covering simulated and real experiments for cell differentiation, gene perturbation, drug treatment, and vascular‑organoid radiation response. All data underwent uniform quality control: cells with >20% mitochondrial reads or <1,000 detected genes were removed, low‑expression genes were filtered, doublets and stress‑related genes were excluded in some scenarios, and log‑normalization corrected sequencing depth differences to ensure cross‑dataset comparability.
Simulated data were generated with the Splatter tool using a hierarchical gamma‑Poisson distribution to mimic scRNA‑seq variance. Real differentiation data comprised 4,800 cells from human iPSC to endoderm (days 0‑3); the model was trained on days 0 and 3 and tested on days 1‑2, using the top 203 highly variable genes. Gene‑perturbation data came from a CRISPR screen in K562 cells (≈10,000 cells) with single and double knock‑outs of ZBTB25 and PTPN12. Drug‑treatment data included multiple cell lines exposed to six chemotherapeutics and melanoma drug‑combination responses, integrating SMILES strings, dosage, and fingerprint features. The vascular‑organoid (BVO) dataset was built in‑house from iPSC‑derived endothelial, mural, and fibroblast cells, sampled at day 5 after neutron or photon irradiation and day 11 for scRNA‑seq, yielding ~60,000 cells across 72 organoids with ELISA‑measured inflammatory markers.
Model Architecture
Squidiff combines a semantic encoder and a conditional DDIM diffusion module. The encoder, a multi‑layer perceptron (MLP), maps scRNA‑seq profiles into a low‑dimensional semantic space (Z_sem) that captures cell‑type and perturbation information. For drug scenarios, the encoder incorporates a re‑calibrated functional fingerprint (r_FCFP) that embeds 2,048‑dimensional molecular vectors. An adapter module accepts SMILES strings and dosage to fuse chemical information with biological context.
The conditional DDIM follows a forward diffusion process that gradually adds Gaussian noise to the original gene expression vector (x₀) over 1,000 steps, separating cell types into Gaussian‑like clusters while preserving Z_sem‑driven biological variation. The reverse diffusion uses a sinusoidal position‑embedded noise‑prediction network (ε_θ) conditioned on time step t and Z_sem to denoise x_T back to a biologically meaningful transcriptome.
Training Procedure
Training minimizes a noise‑prediction loss using the Adam optimizer (learning rate 1×10⁻⁴) on GPU. Time steps and semantic variables are jointly modulated, enabling the model to generate continuous cellular trajectories.
Performance Advantages
Compared with traditional variational autoencoders, Squidiff does not assume a Gaussian prior, captures complex expression patterns via fine‑grained denoising, and improves F1 scores for rare cell types (<5% abundance) by 27%. A novel “gradient interpolation” strategy linearly combines semantic variables in latent space to produce continuous differentiation paths, revealing transient states such as mid‑endoderm precursors that conventional models miss.
Two latent‑variable manipulation modes are provided: “addition” adds a perturbation direction ΔZ_sem to a baseline representation, shifting the gene‑expression distribution; “interpolation” linearly interpolates between vectors to generate smooth intermediate states.
Multi‑Scenario Validation
Cell‑differentiation prediction: Using the iPSC‑to‑endoderm dataset, Squidiff trained on days 0 and 3 accurately reconstructed days 1‑2 intermediate states, correctly down‑regulating pluripotency marker MMOG, up‑regulating endoderm factor GATA6, and transiently expressing mesoderm marker DBX1. The generated transcriptomes aligned closely with the true developmental trajectory.
Gene and drug perturbation prediction: In the K562 double‑knockout experiment, Squidiff predicted non‑additive effects without prior knowledge, outperforming existing methods. For drug experiments, the model inferred combination‑drug synergy from single‑drug data and identified a specific effect of Pabistatin on tumor cells. Using the SMILES‑based adapter, Squidiff predicted responses to the unseen drug sglt1 with performance comparable to specialized models.
Blood‑vessel organoid (BVO) modeling: Trained only on day‑0 and day‑11 iPSC‑derived BVO data, Squidiff reproduced differentiation trajectories of endothelial, fibroblast, and mural cells and uncovered an intermediate mural‑to‑endothelial transition missed by traditional approaches. Gene‑expression changes matched known developmental patterns.
Radiation damage and protection: With training limited to endothelial cells, Squidiff accurately forecasted radiation effects on all cell types, showing higher sensitivity in early‑development cells. In G‑CSF protection simulations, the model revealed pathway‑specific protective mechanisms: angiogenesis activation in fibroblasts, apoptosis inhibition in endothelial cells, and genome‑stability enhancement in mural cells. Experimental validation confirmed a significant reduction in cell death after G‑CSF treatment.
Implications
These systematic experiments demonstrate that Squidiff reliably predicts transcriptional changes across diverse biological scenarios, captures transient cellular states, and generalizes to unseen perturbations, providing a powerful computational tool for precision and regenerative medicine.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
HyperAI Super Neural
Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
