Efficient Spatiotemporal Self‑Attention Transformer (Patch Shift Transformer) for Video Action Recognition
This article introduces a lightweight spatiotemporal self‑attention transformer, called Patch Shift Transformer, which achieves competitive video action recognition performance on datasets such as Kinetics‑400, Sth‑v1/v2, and Diving48 without increasing computational cost or parameters, and details its design, experiments, and speed advantages.
Efficient spatiotemporal modeling is a core challenge for video understanding; directly extending the self‑attention mechanism of image Transformers to the temporal dimension incurs prohibitive computational and memory costs, and existing decomposed approaches (e.g., ViViT, Timesformer) reduce complexity at the expense of additional parameters.
The paper proposes a simple yet effective Patch Shift Transformer (PST). By shifting a subset of patches from one frame to neighboring frames before the self‑attention operation, each frame’s attention simultaneously captures spatial and temporal information without any extra arithmetic; the operation is a pure memory move. The authors also explore a complementary channel‑shift strategy and alternate the two shifts within the network.
Extensive experiments on Sth‑v1, Sth‑v2, Kinetics‑400, and Diving48 demonstrate that PST attains state‑of‑the‑art accuracy while keeping the FLOPs and parameter count comparable to 2D Swin. Ablation studies verify the contribution of patch and channel shifts, and speed measurements show inference latency close to 2D Swin but with superior spatiotemporal modeling, outperforming Video‑Swin in both runtime and memory usage. Visualizations illustrate that PST learns motion trajectories aligned with the underlying actions.
In summary, PST offers an efficient way to endow 2D vision Transformers with temporal modeling capability, delivering strong video action recognition results with minimal overhead.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.