Artificial Intelligence 14 min read

MIT’s APOLLO Framework Breaks Limits, Separating Shared and Modality‑Specific Cell Signals

MIT and ETH Zurich introduce APOLLO, a deep‑learning autoencoder that learns a partially overlapping latent space to explicitly disentangle shared and modality‑specific information in multimodal single‑cell datasets, demonstrating superior cell‑type classification, cross‑modal prediction, and protein localization insights across sequencing and imaging data.

HyperAI Super Neural

Mar 4, 2026

MIT’s APOLLO Framework Breaks Limits, Separating Shared and Modality‑Specific Cell Signals

Rapid advances in single‑cell technologies—such as multiplexed imaging, scRNA‑seq, scATAC‑seq, and protein abundance detection—enable panoramic observation of individual cells across transcriptional, chromatin, protein, and morphological dimensions. Integrating these high‑throughput multimodal data promises deeper insight into cellular heterogeneity and disease mechanisms.

Current integration strategies suffer two major drawbacks. The dominant approach analyzes each modality separately and then aligns results, which is inefficient and fails to capture deep cross‑modal relationships. Alternative methods that embed all modalities into a single latent space often conflate shared information with modality‑specific signals, obscuring each modality’s unique contribution.

To address these issues, a joint MIT–ETH Zurich team proposes APOLLO (Autoencoder with a Partially Overlapping Latent space learned through Latent Optimization). APOLLO explicitly models shared and modality‑specific information by learning a latent space where only a subset of dimensions is aligned across modalities, while the remaining dimensions retain modality‑specific representations.

The architecture equips each modality with its own autoencoder; encoders and decoders are tailored to the data type (e.g., convolutional networks for imaging, fully‑connected networks for gene expression). The latent space is divided into a large shared subspace and smaller modality‑specific subspaces. Training proceeds in two steps: (1) jointly train all decoders while updating the latent space to reconstruct inputs, optionally adding extra decoders to map the shared space to each modality for cross‑modal prediction; (2) train modality‑specific encoders to map raw data into their respective latent subspaces, minimizing mean‑squared error to ensure robust inference on unseen samples.

APOLLO’s performance was evaluated on several public multimodal single‑cell datasets. For sequencing data, paired SHARE‑seq scRNA‑seq and scATAC‑seq measurements were used to test whether the model could separate gene activity captured jointly by both modalities from activity captured by a single modality. A CITE‑seq dataset from mouse spleen and lymph node (two wild‑type samples) provided paired scRNA‑seq and surface‑protein abundance, allowing assessment of cell‑type discrimination and batch‑effect separation. For imaging, a human peripheral‑blood mononuclear cell (PBMC) dataset comprising 40 patients and 32,345 cells was collected with two antibody panels, enabling analysis of chromatin structure, protein localization, and morphological features.

On the sequencing benchmarks, APOLLO successfully identified shared gene activity while preserving modality‑specific signals, leading to a marked increase in cell‑type classification accuracy compared with methods that lack a dedicated specific subspace. In the CITE‑seq experiment, the model separated biological cell‑type variation into the shared space and isolated batch effects into the RNA‑specific space, a capability absent in existing integrators.

Imaging experiments showed that APOLLO could accurately reconstruct images of cells from patients not seen during training. When tasked with predicting unmeasured proteins from chromatin images, APOLLO outperformed conventional image‑inpainting techniques, and downstream phenotype classification using predicted protein images achieved accuracy comparable to that using real images, with CD3 prediction performing best.

Latent‑space analysis revealed biologically meaningful partitioning: the RNA‑specific space was enriched for cell‑cycle genes, the ATAC‑specific space for chromatin‑accessibility regions, and the shared space for known transcription factors and regulatory pathways. In imaging, the shared space captured chromatin morphology (e.g., nuclear area, heterochromatin volume), while protein‑specific spaces captured features such as γH2AX foci counts. Ablation studies confirmed that removing modality‑specific features significantly reduced classification performance, validating the disentanglement.

Model robustness was further demonstrated on five simulated datasets with known latent structures; APOLLO maintained stable performance regardless of the dependency between shared and specific features. On real data, explicit learning of partially shared information enabled precise cross‑modal predictions, such as inferring protein expression from chromatin images.

Overall, APOLLO provides a general deep‑learning framework that learns a partially shared latent space to disentangle and interpret multimodal single‑cell data, facilitating mechanistic discovery and downstream applications. The method was described in the Nature Computational Science paper “Partially shared multi‑modal embedding learns holistic representation of cell state” (2025). Related multimodal integration efforts such as scMTR‑seq (six‑modality histone‑modification profiling) and CellFuse (supervised contrastive learning for limited feature overlap) are discussed as complementary advances in the field.