scSiameseClu Sets New SOTA on Unsupervised Single‑Cell Clustering Across 7 Datasets
The paper introduces scSiameseClu, a Siamese clustering framework that combines dual augmentation, siamese fusion, and optimal‑transport clustering to overcome representation collapse in scRNA‑seq data, and demonstrates state‑of‑the‑art performance on seven diverse single‑cell datasets and downstream annotation tasks.
Background and Motivation
Traditional bulk RNA‑Seq averages gene expression across cells, masking rare cell types. Single‑cell RNA‑Seq (scRNA‑seq) captures full transcriptomes of individual cells, but the data are noisy, sparse, and high‑dimensional, causing representation collapse in existing deep‑learning and graph‑neural‑network methods such as scNAME and scGNN.
scSiameseClu Framework
scSiameseClu is a Siamese clustering framework composed of three modules:
Dual Augmentation Module : adds Gaussian noise to gene expression and applies edge perturbation plus graph diffusion to the cell adjacency matrix, improving robustness and generalization.
Siamese Fusion Module : aligns and merges the augmented gene‑expression and graph embeddings via a cross‑correlation refinement and adaptive information fusion strategy; includes a propagation regularization term constrained by Jensen‑Shannon divergence to mitigate over‑smoothing.
Optimal Transport Clustering : computes similarity between cells and cluster centroids with a Student’s t‑distribution, then refines the predicted distribution using the Sinkhorn algorithm, guaranteeing balanced clusters and avoiding collapse.
The overall architecture is illustrated below.
Dataset Preparation
Seven real scRNA‑seq datasets (3 mouse, 4 human) were pre‑processed by removing genes expressed in fewer than three cells, normalizing, log‑transforming to TPM, and selecting highly variable genes based on mean‑variance thresholds. The datasets span tissues such as retina, lung, liver, kidney, and pancreas.
Benchmark Evaluation
scSiameseClu was compared against nine SOTA baselines (traditional clustering, deep‑NN, and GNN methods) on the seven datasets using three clustering metrics: Accuracy (ACC), Normalized Mutual Information (NMI), and Adjusted Rand Index (ARI). In all three metrics scSiameseClu achieved the highest scores and showed stable performance across datasets. Visual comparison on a human liver dataset reveals clearly separated clusters with sharp boundaries.
Downstream Cell‑type Annotation
On a human pancreas dataset, scSiameseClu’s clusters were annotated with Seurat‑identified marker genes. The top‑50 marker genes from each cluster matched the gold‑standard cell types with >90 % similarity, demonstrating accurate cell‑type recovery.
Ablation Study
On the Shekhar mouse retina dataset, removing each of the three key components (SFM loss, ZINB loss, OTC loss) reduced performance, confirming their contributions. Further ablation of sub‑components within the Siamese Fusion Module (cell‑related refinement, latent‑related refinement, propagation regularization, reconstruction loss) also caused degradation, indicating that the full combination is necessary for optimal results.
Conclusion
By integrating dual augmentation, siamese fusion, and optimal‑transport clustering, scSiameseClu effectively mitigates representation collapse and preserves cellular heterogeneity, achieving SOTA unsupervised clustering on diverse scRNA‑seq datasets and supporting accurate downstream biological analyses.
The method was presented at IJCAI 2025 and the preprint is available on arXiv.
HyperAI Super Neural
Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
