Artificial Intelligence 8 min read

Action Sensitivity Learning for Temporal Action Localization

The paper presents Action Sensitivity Learning (ASL), a framework that models frame‑wise importance at both class‑level (via learnable Gaussian distributions) and instance‑level (using quality scores), integrates these weights into classification and regression losses, adds a contrastive InfoNCE term, and achieves state‑of‑the‑art temporal action localization performance across six benchmark datasets.

DaTaobao Tech

Aug 21, 2023

Action Sensitivity Learning for Temporal Action Localization

The paper, accepted by ICCV 2023, introduces Action Sensitivity Learning (ASL) for Temporal Action Localization (TAL), incorporating fine-grained frame-level information.

Motivation: Different frames contribute unequally to action recognition and boundary localization; key frames are more informative than transition or blurry frames.

Method: ASL models frame importance (action sensitivity) at both class and instance levels. Class‑level sensitivity is represented by a learnable Gaussian distribution per action, with formulas p_cls and p_loc. Instance‑level sensitivity is derived from predicted quality scores Q based on classification confidence and temporal overlap, learned via MSE loss. The combined sensitivity weights are inserted into the loss functions (Focal loss for classification, DIoU loss for regression). Additionally, a contrastive loss based on InfoNCE uses sensitivity‑enhanced features as positives and other actions/background as negatives.

Overall loss: L = L_Focal + L_DIoU + λ * L_contrastive (illustrated in the paper).

Experiments: Evaluated on six datasets (MultiThumos, Charades, Ego4D‑Moment Query, Thumos‑14, ActivityNet) covering dense, egocentric, and single‑label settings. ASL consistently outperforms previous SOTA in average mAP, with notable gains on MultiThumos and Charades (see Table 1) and competitive results on Ego4D, Thumos‑14, and ActivityNet.

Ablation studies (Table 4) show that both class‑level and instance‑level modeling improve performance, and the contrastive loss yields the best results.

Conclusion: By learning and exploiting frame‑wise action sensitivity, ASL enhances TAL models and sets new state‑of‑the‑art results, benefiting short‑video content understanding for e‑commerce platforms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Computer Vision Deep Learning Action Sensitivity Learning Temporal Action Localization video understanding

Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.