scSiameseClu Sets New SOTA on Unsupervised Single‑Cell Clustering Across 7 Datasets

The paper introduces scSiameseClu, a Siamese clustering framework that combines dual augmentation, siamese fusion, and optimal‑transport clustering to overcome representation collapse in scRNA‑seq data, and demonstrates state‑of‑the‑art performance on seven diverse single‑cell datasets and downstream annotation tasks.

HyperAI Super Neural
HyperAI Super Neural
HyperAI Super Neural
scSiameseClu Sets New SOTA on Unsupervised Single‑Cell Clustering Across 7 Datasets

Background and Motivation

Traditional bulk RNA‑Seq averages gene expression across cells, masking rare cell types. Single‑cell RNA‑Seq (scRNA‑seq) captures full transcriptomes of individual cells, but the data are noisy, sparse, and high‑dimensional, causing representation collapse in existing deep‑learning and graph‑neural‑network methods such as scNAME and scGNN.

scSiameseClu Framework

scSiameseClu is a Siamese clustering framework composed of three modules:

Dual Augmentation Module : adds Gaussian noise to gene expression and applies edge perturbation plus graph diffusion to the cell adjacency matrix, improving robustness and generalization.

Siamese Fusion Module : aligns and merges the augmented gene‑expression and graph embeddings via a cross‑correlation refinement and adaptive information fusion strategy; includes a propagation regularization term constrained by Jensen‑Shannon divergence to mitigate over‑smoothing.

Optimal Transport Clustering : computes similarity between cells and cluster centroids with a Student’s t‑distribution, then refines the predicted distribution using the Sinkhorn algorithm, guaranteeing balanced clusters and avoiding collapse.

The overall architecture is illustrated below.

scSiameseClu architecture overview
scSiameseClu architecture overview

Dataset Preparation

Seven real scRNA‑seq datasets (3 mouse, 4 human) were pre‑processed by removing genes expressed in fewer than three cells, normalizing, log‑transforming to TPM, and selecting highly variable genes based on mean‑variance thresholds. The datasets span tissues such as retina, lung, liver, kidney, and pancreas.

Overview of the 7 scRNA‑seq datasets
Overview of the 7 scRNA‑seq datasets

Benchmark Evaluation

scSiameseClu was compared against nine SOTA baselines (traditional clustering, deep‑NN, and GNN methods) on the seven datasets using three clustering metrics: Accuracy (ACC), Normalized Mutual Information (NMI), and Adjusted Rand Index (ARI). In all three metrics scSiameseClu achieved the highest scores and showed stable performance across datasets. Visual comparison on a human liver dataset reveals clearly separated clusters with sharp boundaries.

Visualization of scSiameseClu vs baselines on human liver data
Visualization of scSiameseClu vs baselines on human liver data

Downstream Cell‑type Annotation

On a human pancreas dataset, scSiameseClu’s clusters were annotated with Seurat‑identified marker genes. The top‑50 marker genes from each cluster matched the gold‑standard cell types with >90 % similarity, demonstrating accurate cell‑type recovery.

Overlap of differential genes with gold‑standard cell types
Overlap of differential genes with gold‑standard cell types

Ablation Study

On the Shekhar mouse retina dataset, removing each of the three key components (SFM loss, ZINB loss, OTC loss) reduced performance, confirming their contributions. Further ablation of sub‑components within the Siamese Fusion Module (cell‑related refinement, latent‑related refinement, propagation regularization, reconstruction loss) also caused degradation, indicating that the full combination is necessary for optimal results.

Ablation results on the Shekhar retina dataset
Ablation results on the Shekhar retina dataset
Detailed component ablation on the Shekhar retina dataset
Detailed component ablation on the Shekhar retina dataset

Conclusion

By integrating dual augmentation, siamese fusion, and optimal‑transport clustering, scSiameseClu effectively mitigates representation collapse and preserves cellular heterogeneity, achieving SOTA unsupervised clustering on diverse scRNA‑seq datasets and supporting accurate downstream biological analyses.

The method was presented at IJCAI 2025 and the preprint is available on arXiv.

clusteringgraph neural networkSiamese networkoptimal transportrepresentation collapsescRNA-seq
HyperAI Super Neural
Written by

HyperAI Super Neural

Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.