Artificial Intelligence 8 min read

Action Sensitivity Learning for Temporal Action Localization

The paper presents Action Sensitivity Learning (ASL), a framework that models frame‑wise importance at both class‑level (via learnable Gaussian distributions) and instance‑level (using quality scores), integrates these weights into classification and regression losses, adds a contrastive InfoNCE term, and achieves state‑of‑the‑art temporal action localization performance across six benchmark datasets.

DaTaobao Tech
DaTaobao Tech
DaTaobao Tech
Action Sensitivity Learning for Temporal Action Localization

The paper, accepted by ICCV 2023, introduces Action Sensitivity Learning (ASL) for Temporal Action Localization (TAL), incorporating fine-grained frame-level information.

Motivation: Different frames contribute unequally to action recognition and boundary localization; key frames are more informative than transition or blurry frames.

Method: ASL models frame importance (action sensitivity) at both class and instance levels. Class‑level sensitivity is represented by a learnable Gaussian distribution per action, with formulas p_cls and p_loc. Instance‑level sensitivity is derived from predicted quality scores Q based on classification confidence and temporal overlap, learned via MSE loss. The combined sensitivity weights are inserted into the loss functions (Focal loss for classification, DIoU loss for regression). Additionally, a contrastive loss based on InfoNCE uses sensitivity‑enhanced features as positives and other actions/background as negatives.

Overall loss: L = L_Focal + L_DIoU + λ * L_contrastive (illustrated in the paper).

Experiments: Evaluated on six datasets (MultiThumos, Charades, Ego4D‑Moment Query, Thumos‑14, ActivityNet) covering dense, egocentric, and single‑label settings. ASL consistently outperforms previous SOTA in average mAP, with notable gains on MultiThumos and Charades (see Table 1) and competitive results on Ego4D, Thumos‑14, and ActivityNet.

Ablation studies (Table 4) show that both class‑level and instance‑level modeling improve performance, and the contrastive loss yields the best results.

Conclusion: By learning and exploiting frame‑wise action sensitivity, ASL enhances TAL models and sets new state‑of‑the‑art results, benefiting short‑video content understanding for e‑commerce platforms.

computer visiondeep learningAction Sensitivity LearningTemporal Action Localizationvideo understanding
DaTaobao Tech
Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.