Artificial Intelligence 13 min read

Survey of Video Action Recognition Algorithms: 3D and 2D Convolutional Networks and Pre‑training

This survey reviews video action recognition, comparing 3D convolutional networks that jointly model spatial‑temporal cues but are computationally heavy with 2D‑based approaches like TSM and TIN that embed temporal shifts efficiently, and emphasizes how large‑scale pre‑training markedly improves performance despite limited labeled data.

NetEase Media Technology Team

Jul 24, 2020

Survey of Video Action Recognition Algorithms: 3D and 2D Convolutional Networks and Pre‑training

Short videos dominate current multimedia traffic, and understanding video content is crucial for data distribution. Action recognition, a key direction of video understanding, requires modeling both spatial semantics and temporal dynamics, making it more complex than image recognition. This article reviews the evolution of video action recognition methods, focusing on two main families: 3D convolutional network architectures and 2D convolutional network architectures, both aiming to capture spatial and temporal information.

3D Convolutional Network Architectures – Historically, 3D CNNs have been the mainstream solution. Representative works include Two‑Stream Network, I3D, and SlowFast. By extending convolution to the time dimension, 3D kernels jointly process spatial and temporal cues (e.g., C3D). However, the added dimension dramatically increases parameters and computation, and prevents direct use of pretrained 2D weights, limiting performance on small datasets. Subsequent improvements such as P3D, R(2+1)D, and others introduce hybrid initialization and factorized convolutions to reduce cost while preserving accuracy. I3D initializes 3D kernels by repeating pretrained 2D kernels, achieving better results than random initialization. SlowFast separates the network into a low‑frame‑rate “slow” branch for spatial semantics and a high‑frame‑rate “fast” branch for rapid temporal changes; the fast branch is lightweight (its channel count is a fraction p < 1 of the slow branch) and its features are fused into the slow branch to boost performance.

2D Convolutional Network Architectures – To alleviate the heavy computation of 3D CNNs, 2D‑based methods model temporal information within a primarily 2D framework. Notable examples are TSM (Temporal Shift Module) and TIN (Temporal Interlacing Network). TSM shifts a portion of feature channels along the temporal axis at each layer, enabling implicit temporal modeling with negligible extra cost. TIN extends this idea by dynamically learning the shift offsets and weighting each shifted channel via shallow offset and weight networks, achieving a flexible temporal receptive field while keeping computational overhead comparable to pure 2D CNNs.

Model Pre‑training – Video action recognition also suffers from limited labeled data. Large‑scale weakly supervised pre‑training (e.g., using seed labels mined from social media) has been shown to substantially improve downstream performance. Experiments indicate that larger pre‑training datasets, comprehensive label coverage, and training on longer video clips all contribute positively, whereas overly large label vocabularies may yield diminishing returns. Pre‑training on video clips outperforms image‑level pre‑training for this task.

Conclusion – The survey highlights the trade‑offs between 3D and 2D convolutional approaches and underscores the importance of effective pre‑training. While challenges such as computational cost and data scarcity remain, video action recognition holds great potential for industrial applications, exemplified by its use in generating dynamic video covers for the NetEase News app.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

computer vision pretraining 2D convolutional networks 3D convolutional networks temporal modeling video action recognition

Written by

NetEase Media Technology Team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.