Artificial Intelligence 11 min read

scSiameseClu Sets New SOTA on Unsupervised Single‑Cell Clustering Across 7 Datasets

The paper introduces scSiameseClu, a Siamese clustering framework that combines dual augmentation, siamese fusion, and optimal‑transport clustering to overcome representation collapse in scRNA‑seq data, and demonstrates state‑of‑the‑art performance on seven diverse single‑cell datasets and downstream annotation tasks.

HyperAI Super Neural

Sep 15, 2025

scSiameseClu Sets New SOTA on Unsupervised Single‑Cell Clustering Across 7 Datasets

Background and Motivation

Traditional bulk RNA‑Seq averages gene expression across cells, masking rare cell types. Single‑cell RNA‑Seq (scRNA‑seq) captures full transcriptomes of individual cells, but the data are noisy, sparse, and high‑dimensional, causing representation collapse in existing deep‑learning and graph‑neural‑network methods such as scNAME and scGNN.

scSiameseClu Framework

scSiameseClu is a Siamese clustering framework composed of three modules:

Dual Augmentation Module : adds Gaussian noise to gene expression and applies edge perturbation plus graph diffusion to the cell adjacency matrix, improving robustness and generalization.

Siamese Fusion Module : aligns and merges the augmented gene‑expression and graph embeddings via a cross‑correlation refinement and adaptive information fusion strategy; includes a propagation regularization term constrained by Jensen‑Shannon divergence to mitigate over‑smoothing.

Optimal Transport Clustering : computes similarity between cells and cluster centroids with a Student’s t‑distribution, then refines the predicted distribution using the Sinkhorn algorithm, guaranteeing balanced clusters and avoiding collapse.

The overall architecture is illustrated below.

Dataset Preparation

Seven real scRNA‑seq datasets (3 mouse, 4 human) were pre‑processed by removing genes expressed in fewer than three cells, normalizing, log‑transforming to TPM, and selecting highly variable genes based on mean‑variance thresholds. The datasets span tissues such as retina, lung, liver, kidney, and pancreas.

Benchmark Evaluation

scSiameseClu was compared against nine SOTA baselines (traditional clustering, deep‑NN, and GNN methods) on the seven datasets using three clustering metrics: Accuracy (ACC), Normalized Mutual Information (NMI), and Adjusted Rand Index (ARI). In all three metrics scSiameseClu achieved the highest scores and showed stable performance across datasets. Visual comparison on a human liver dataset reveals clearly separated clusters with sharp boundaries.

Visualization of scSiameseClu vs baselines on human liver data

Downstream Cell‑type Annotation

On a human pancreas dataset, scSiameseClu’s clusters were annotated with Seurat‑identified marker genes. The top‑50 marker genes from each cluster matched the gold‑standard cell types with >90 % similarity, demonstrating accurate cell‑type recovery.

Overlap of differential genes with gold‑standard cell types

Ablation Study

On the Shekhar mouse retina dataset, removing each of the three key components (SFM loss, ZINB loss, OTC loss) reduced performance, confirming their contributions. Further ablation of sub‑components within the Siamese Fusion Module (cell‑related refinement, latent‑related refinement, propagation regularization, reconstruction loss) also caused degradation, indicating that the full combination is necessary for optimal results.

Ablation results on the Shekhar retina dataset

Detailed component ablation on the Shekhar retina dataset

Conclusion

By integrating dual augmentation, siamese fusion, and optimal‑transport clustering, scSiameseClu effectively mitigates representation collapse and preserves cellular heterogeneity, achieving SOTA unsupervised clustering on diverse scRNA‑seq datasets and supporting accurate downstream biological analyses.

The method was presented at IJCAI 2025 and the preprint is available on arXiv.

clustering graph neural network Siamese network optimal transport representation collapse scRNA-seq