Artificial Intelligence 11 min read

How ACLNet Boosts Skeleton-Based Action Recognition with Affinity Contrastive Learning

ACLNet, an Affinity Contrastive Learning Network introduced by researchers from the Chinese Academy of Sciences, BUPT and Moonshot AI, tackles the ambiguity of skeleton‑based human activity recognition by modeling inter‑class structural similarities and intra‑class margins, achieving state‑of‑the‑art results on NTU‑RGB+D, Kinetics‑Skeleton, FineGYM and other benchmarks.

AI Frontier Lectures

Jan 27, 2026

How ACLNet Boosts Skeleton-Based Action Recognition with Affinity Contrastive Learning

Background

Skeleton‑based human activity recognition is a hot topic in computer vision because skeleton data are computationally efficient and robust to lighting and background changes. However, skeleton models often struggle to differentiate very similar actions (e.g., reading vs. writing) due to the lack of object and fine‑grained shape information.

Motivation

Existing contrastive learning approaches for skeleton recognition treat all inter‑class pairs equally: they simply pull together samples of the same class and push apart samples of different classes. This leads to two major problems:

Ignoring inter‑class structural commonalities : actions with highly similar motion patterns (e.g., drinking vs. reading) are forced apart, preventing the model from learning subtle discriminative cues.

Intra‑class outliers : variations in camera angle or motion amplitude create hard positive samples that can be confused with negative samples, causing noisy clustering.

ACLNet Overview

The proposed Affinity Contrastive Learning Network (ACLNet) introduces two core strategies: inter‑class affinity contrastive learning and intra‑class marginal contrastive learning .

1. Architecture and Pipeline

Input : A raw skeleton sequence of T frames, J joints, each with C coordinate features.

Backbone & Projection : A Graph Convolutional Network (GCN) extracts spatio‑temporal features, which are then projected into a 256‑dimensional contrastive feature space.

Output : A classification head predicts action labels, while the affinity contrastive loss shapes the feature distribution.

2. Motion Family Discovery

ACLNet computes an Affinity Similarity score in two steps:

Direct association : Use the confusion matrix to identify class pairs that are frequently mis‑classified.

Indirect association : If classes A and B are both often confused with class C, they share a hidden structural similarity.

Classes with high affinity are grouped into a Motion Family . During training, the model performs targeted contrastive optimization within each family, encouraging finer discrimination among similar motions.

3. Family‑Aware Temperature Schedule (FAT)

The temperature τ in the contrastive loss is dynamically adjusted based on family size. Small families use a low temperature (e.g., 0.1) to amplify hard negatives, while large families use a higher temperature to maintain stable clustering. This adaptive schedule acts like “personalized tutoring” for the model.

4. Intra‑Class Marginal Contrastive Loss

To handle hard positive samples, ACLNet adds a margin term m that forces a minimum distance between any positive‑negative pair, even when the positive looks very similar to a negative. This improves robustness against noisy intra‑class variations.

Experimental Results

ACLNet was evaluated on six mainstream benchmarks covering action recognition, gait recognition, and person re‑identification:

NTU RGB+D 60/120 : X‑Sub accuracies of 93.6% and 90.7%, respectively, setting new SOTA.

Kinetics‑Skeleton : Top‑1 accuracy of 52.1%, a clear gain over the previous best DS‑GCN.

FineGYM : 96.0% accuracy, demonstrating exceptional fine‑grained discrimination.

CASIA‑B (gait) : 88.5% average accuracy.

Person Re‑ID (N‑N setting) : 82.8% accuracy.

All results were reproduced with a single RTX 3090 GPU, and the code is publicly available on GitHub.

Insights from Ablation Studies

1. What Do Motion Families Look Like?

Visualization shows that actions such as “reading” and “wearing a jacket” share similar hand‑arm trajectories, and ACLNet groups them into the same family, enabling the model to focus on subtle differences.

2. Hyper‑Parameter Sensitivity

The margin m and loss weight λ are critical. Experiments indicate that setting λ = 0.1 yields the best trade‑off between discrimination and generalization.

3. Robustness to Missing Limbs

When simulating severe occlusions (e.g., missing both hands), ACLNet retains 79.6% accuracy, far surpassing the classic MS‑G3D baseline (17.1%).

4. Breaking the “Deep‑Water” of Similar Actions

Classes that traditionally cause large errors—such as “sneezing/coughing”, “reading”, and “typing”—see the biggest performance jumps, thanks to the affinity‑driven separation of their feature clusters.

Conclusion

ACLNet demonstrates that contrastive learning for skeleton data should go beyond a simple “pull‑close / push‑away” paradigm. By incorporating affinity information and explicit margin constraints, the model achieves human‑level understanding of subtle motion nuances, offering a strong, extensible baseline for future research in action, gait, and biometric recognition.