Artificial Intelligence 8 min read

Key Findings from Alibaba Moku Lab at ACM MM 2021

At ACM MM 2021, Alibaba’s Moku Lab presented four cutting‑edge studies: an interactive video inpainting system using user doodles, a decoupled IoU regression model for object detection, a spatio‑temporal distortion‑aware video quality assessment framework, and a multimodal emotional relationship recognition dataset and benchmark.

Youku Technology

Jul 8, 2021

Key Findings from Alibaba Moku Lab at ACM MM 2021

Deep Interactive Video Inpainting: an Invisibility Cloak for Harry Potter

Alibaba’s Moku Lab proposes a new interactive video inpainting task and an end‑to‑end framework that, for the first time, relies solely on arbitrary user scribbles rather than per‑frame mask annotations. A shared spatio‑temporal memory module fuses interactive video object segmentation and video restoration. Historical frames with object masks (either user scribbles or network‑predicted masks) are fed into the memory to assist segmentation and restoration of the current frame. The system supports iterative mask refinement, which improves restoration quality on challenging sequences. Both qualitative and quantitative experiments demonstrate the method’s superiority.

Decoupled IoU Regression for Object Detection

Non‑maximum suppression (NMS) is widely used in object detectors, but the mismatch between NMS confidence scores and true localization IoU degrades performance. Existing methods that predict IoU still face accuracy challenges. This paper analyzes the shortcomings of current IoU prediction approaches and introduces a novel Decoupled IoU Regression (DIR) model. DIR separates the complex IoU metric into two new indicators—Purity and Integrity—and predicts each independently. Additionally, a simple yet effective feature re‑alignment technique predicts IoU in a hindsight manner, yielding a more stable mapping. Extensive experiments show that DIR can be easily integrated into existing two‑stage detectors and significantly boosts their accuracy.

Perceptual Quality Assessment of Internet Videos

With the rapid rise of online video platforms, effective quality assessment of user‑generated, professionally‑generated, and occupationally‑generated content is essential. Moku Lab built the NET‑1k dataset, containing 1,072 videos selected for maximal content and distortion diversity using multiple quality metrics, and cleaned noisy subjective scores via a probabilistic graphical model. Based on the characteristics of internet videos, the paper proposes STDAM (Spatio‑Temporal Distortion‑Aware Model), which does not require a high‑resolution reference. Pre‑training on large image datasets enables the model to handle complex content. The architecture incorporates graph convolution and attention modules to capture spatial distortions, a flow module to exploit motion information, and a bidirectional LSTM to fuse frame‑level features into a video‑level representation for temporal distortion assessment. STDAM achieves superior performance on NET‑1k and demonstrates strong generalization in cross‑dataset evaluations.

Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and Benchmark

The paper introduces the Pairwise Emotional Relationship Recognition (PERR) task, which aims to identify whether a pair of actors in a video segment share an intimate, hostile, or neutral relationship by leveraging multimodal cues such as background music, subtitles, facial expressions, gestures, and dialogue. To support this task, the authors collected a large‑scale multimodal annotated dataset named ERATO. They propose a synchronized multimodal‑temporal attention unit to process the diverse streams, and a multimodal fusion mechanism that can be extended to other tasks. Experiments on two datasets show that the proposed approach outperforms existing methods.

computer vision object detection video quality assessment Video Inpainting multimodal emotion recognition

Written by

Youku Technology

Discover top-tier entertainment technology here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.