Artificial Intelligence 5 min read

Efficient Spatiotemporal Self‑Attention Transformer (Patch Shift Transformer) for Video Action Recognition

This article introduces a lightweight spatiotemporal self‑attention transformer, called Patch Shift Transformer, which achieves competitive video action recognition performance on datasets such as Kinetics‑400, Sth‑v1/v2, and Diving48 without increasing computational cost or parameters, and details its design, experiments, and speed advantages.

DataFunTalk
DataFunTalk
DataFunTalk
Efficient Spatiotemporal Self‑Attention Transformer (Patch Shift Transformer) for Video Action Recognition

Efficient spatiotemporal modeling is a core challenge for video understanding; directly extending the self‑attention mechanism of image Transformers to the temporal dimension incurs prohibitive computational and memory costs, and existing decomposed approaches (e.g., ViViT, Timesformer) reduce complexity at the expense of additional parameters.

The paper proposes a simple yet effective Patch Shift Transformer (PST). By shifting a subset of patches from one frame to neighboring frames before the self‑attention operation, each frame’s attention simultaneously captures spatial and temporal information without any extra arithmetic; the operation is a pure memory move. The authors also explore a complementary channel‑shift strategy and alternate the two shifts within the network.

Extensive experiments on Sth‑v1, Sth‑v2, Kinetics‑400, and Diving48 demonstrate that PST attains state‑of‑the‑art accuracy while keeping the FLOPs and parameter count comparable to 2D Swin. Ablation studies verify the contribution of patch and channel shifts, and speed measurements show inference latency close to 2D Swin but with superior spatiotemporal modeling, outperforming Video‑Swin in both runtime and memory usage. Visualizations illustrate that PST learns motion trajectories aligned with the underlying actions.

In summary, PST offers an efficient way to endow 2D vision Transformers with temporal modeling capability, delivering strong video action recognition results with minimal overhead.

transformervideo action recognitionspatiotemporal modelingECCV 2022patch shift
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.