How Alibaba Cloud’s MMAI Team Dominated CVPR2021 Video Action Challenges

Alibaba Cloud’s Multimedia AI team won five first‑place titles and one runner‑up across six major video‑action challenges at CVPR2021, showcasing advanced transformer‑CNN hybrids, self‑supervised initialization, and spatio‑temporal relation modeling that now power their multimedia AI cloud products.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How Alibaba Cloud’s MMAI Team Dominated CVPR2021 Video Action Challenges

CVPR2021 Highlights

From June 19‑25, the CVPR2021 conference was held online and attracted massive attention. Alibaba Cloud’s Multimedia AI (MMAI) team participated in six challenge tracks (ActivityNet, AVA‑Kinetics, HACS, EPIC‑Kitchens, etc.) and secured five first‑place and one second‑place finishes, including back‑to‑back titles in ActivityNet and HACS.

Challenge Details and Results

ActivityNet – a large‑scale temporal action detection benchmark started in 2016. MMAI achieved an average mAP of 44.67 % and won the championship.

AVA‑Kinetics – focuses on spatio‑temporal atomic action localization. MMAI obtained a 40.67 % mAP, ranking first.

HACS – the largest temporal action detection challenge to date. MMAI won both the fully‑supervised and weakly‑supervised tracks with average mAPs of 44.67 % and 22.45 % respectively.

EPIC‑Kitchens – first‑person action understanding. MMAI achieved 16.11 % mAP and 48.5 % accuracy, earning a championship and a runner‑up.

Key Technical Innovations (EMC² Framework)

(1) Optimized Base Networks : Extensive study of Video Transformers (ViViT) and hybrid Transformer‑CNN models (SlowFast, CSN), leading to superior performance on EPIC‑Kitchens (48.5 %), ActivityNet (93.6 %), and HACS (96.1 %).

(2) Spatio‑Temporal Relation Modeling : Localization of humans and objects, extraction of their features, and integration with global video features using Transformer‑based relation modules, enabling robust interaction recognition.

(3) Action‑Nomination Relation Encoding : Generation of dense action proposals, followed by self‑attention‑based temporal encoding to provide global context, which contributed to the ActivityNet victory.

(4) Self‑Supervised Initialization (MoSI) : Pseudo‑motion generation from static images and masked motion learning to endow the network with motion perception, improving performance without extra data.

From Research to Product

Leveraging the EMC² foundation, Alibaba Cloud launched the Retina Video Cloud Multimedia AI platform, offering video search, moderation, structuring, and production capabilities that process millions of video hours daily across media, entertainment, sports, and e‑commerce sectors.

References:

Huang Z et al., “Self‑supervised motion learning from static images,” CVPR2021.

Arnab A et al., “ViViT: A video vision transformer,” arXiv 2021.

Feichtenhofer C et al., “SlowFast networks for video recognition,” ICCV2019.

Tran D et al., “Video classification with channel‑separated convolutional networks,” ICCV2019.

Lin T et al., “BMN: Boundary‑matching network for temporal action proposal generation,” ICCV2019.

Feng Y et al., “Relation Modeling in Spatio‑Temporal Action Localization,” arXiv 2021.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

self-supervised learningmultimedia AIAlibaba Cloudvideo action recognitionCVPR2021
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.