Artificial Intelligence 7 min read

How Dynamic Scale Selection Boosts Real-Time Action Prediction

This article explains online action prediction, the challenges of early‑stage classification, and introduces a Scale Selection Network that dynamically chooses optimal temporal windows using dilated convolutions, regression and classification sub‑networks, achieving state‑of‑the‑art results on two benchmark datasets.

Alibaba Cloud Developer

Jul 6, 2018

How Dynamic Scale Selection Boosts Real-Time Action Prediction

Online action prediction aims to classify an action before it is fully performed by using the observed video fragments. The method must be fast enough for real‑time use, work with only a small portion of the action (e.g., the first 10%), and handle unsegmented videos that may contain multiple action instances.

Traditional sliding‑window approaches either use a fixed window size or scan multiple scales repeatedly, which is inefficient for online prediction. A fixed window is also suboptimal because early stages of an action require a small window to avoid noise from previous actions, while later stages benefit from a larger window to cover more of the ongoing action.

The paper proposes a Scale Selection Network (SSNet) that dynamically selects the most appropriate temporal window at each time step. The network consists of three main components:

Temporal 1‑D convolutional backbone built with dilated convolutions, providing hierarchical receptive fields (e.g., layers with ranges 2, 4, 8, …).

Scale regression sub‑network that aggregates features from all convolutional layers and feeds them into a fully connected layer to estimate the temporal distance s from the current frame to the start of the action. This distance represents the portion of the action already observed and determines the suitable window scale.

Classification sub‑network that selects the convolutional layer whose receptive field best matches the estimated scale s, aggregates information from that layer and all lower layers (skip‑connection), and feeds the combined features into another fully connected layer to predict the action class c.

The entire architecture is trained end‑to‑end, allowing the network to regress the optimal scale and predict the action class simultaneously.

The dilated convolution design yields multiple receptive fields, enabling the network to adapt its temporal window dynamically as the action progresses.

The scale regression sub‑network predicts s, which is then used to locate the most suitable convolutional layer. The classification sub‑network combines features from that layer and its predecessors to improve convergence and accuracy.

Experiments on two public datasets show that SSNet outperforms existing methods such as FS‑Net (fixed scale), ST‑LSTM, Attention Net, and JCR‑RNN, and its accuracy approaches that of a version using ground‑truth scales (SSNet‑GT).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Computer Vision Deep Learning dilated convolution online action prediction scale selection network

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.