Interaction-aware Spatio-Temporal Pyramid Attention Networks for Action Classification

Researchers introduce an Interaction‑aware Spatio‑Temporal Pyramid Attention network that embeds a PCA‑guided loss to capture complementary multi‑scale features, enabling end‑to‑end video action classification with state‑of‑the‑art accuracy on UCF101, HMDB51, Charades and internal datasets.

Meitu Technology
Meitu Technology
Meitu Technology
Interaction-aware Spatio-Temporal Pyramid Attention Networks for Action Classification

Researchers from Meitu Cloud Vision and the Institute of Automation, Chinese Academy of Sciences propose a novel self‑attention mechanism that incorporates interaction awareness among local features. By embedding this mechanism into a convolutional neural network (CNN), they build an end‑to‑end architecture for video action classification.

Background In deep CNNs, adjacent spatial positions in a feature map often have highly correlated channel features due to overlapping receptive fields. Conventional self‑attention computes a weight for each local feature independently, ignoring the strong correlations among them.

The authors draw inspiration from Principal Component Analysis (PCA), which extracts the main components of global features and reduces redundancy. They use PCA to guide the design of a loss function that encourages the attention module to capture complementary information across scales.

Core Idea The proposed Interaction‑aware Spatio‑Temporal Pyramid Attention layer first downsamples feature maps from different CNN layers to a unified scale using a sampling function R. Then, attention is applied to the local channel features at each scale to extract key features. A fusion function aggregates multi‑scale features, and attention scores are computed to weight the aggregated representation. The overall architecture is illustrated in the figure below.

The loss function incorporates a PCA‑based term that penalizes redundancy among local features, followed by a classification loss that constrains the multi‑scale pyramid attention to focus on diverse information.

Because the model’s parameters are independent of the number of input feature maps, it naturally extends to video‑level end‑to‑end training. The final network structure is shown below.

Results The method was evaluated on Meitu’s internal video dataset as well as public benchmarks UCF101, HMDB51, and Charades. It achieved state‑of‑the‑art performance, as illustrated in the following figures.

Additional experiments on untrimmed video inputs demonstrated that the model can handle arbitrary numbers of frames while maintaining high classification accuracy.

The visualizations show precise localization of key actions within video frames.

Outlook While the current implementation processes multiple sampled frames, its computational cost is relatively high. Future work will focus on reducing time complexity, especially for latency‑sensitive business scenarios, by optimizing core modules and decreasing the number of sampled frames without sacrificing accuracy.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

CNNAttention Mechanismaction classificationspatio-temporal pyramid
Meitu Technology
Written by

Meitu Technology

Curating Meitu's technical expertise, valuable case studies, and innovation insights. We deliver quality technical content to foster knowledge sharing between Meitu's tech team and outstanding developers worldwide.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.