Champion Solution of Media AI Alibaba Entertainment Video Object Segmentation Challenge
The Youku AI team won the Media AI Alibaba Entertainment Video Object Segmentation Challenge by enhancing the STM model with a spatial‑constrained memory reader, ASPP‑HRNet refinement, ResNeSt‑101 backbone, and a multi‑stage training pipeline, while also devising an unsupervised framework that combines DetectoRS detection, HRNet mask refinement, STM‑based association, and key‑frame optimization to achieve 95.5% test score on a large, richly annotated video dataset.
Video target segmentation is a cutting‑edge topic in video algorithms, with growing applications across many industry scenarios. The "New Content, New Interaction" Global Video Cloud Innovation Challenge, co‑hosted by Intel and Alibaba Cloud and partnered with Youku, focuses on this field and has attracted over 2,000 teams worldwide.
This article uses the Media AI Alibaba Entertainment Algorithm Challenge as an example and presents the champion solution proposed by the Youku AI Department algorithm team, offering participants technical insights and practical experience.
Beyond the usual difficulties of perspective, illumination, scale variation, and occlusion, video person‑segmentation for intelligent video production must also address:
Rich and diverse scene content: the algorithm must correctly identify the main subject under complex background interference.
Complex clothing, hand‑held objects, and accessories: the algorithm must capture detailed semantic appearance.
Rapid, intense human motion: the algorithm must handle motion blur and severe deformation to avoid mis‑segmentation.
The Media AI competition dataset targets high‑precision, instance‑level video person segmentation, providing 1,700 finely annotated video clips (800 training and 50 test clips for each of the preliminary and final rounds), covering the above challenges.
Compared with standard academic/industrial datasets such as DAVIS and YouTube‑VOS, this dataset contains the industry’s largest number of human‑target annotations (180k frames, 300k human instances) and leads in annotation accuracy and content breadth. It includes a wide variety of content—historical dramas, modern series, street shots, dance, sports, etc.—and provides precise mask ground‑truth for fine‑grained edge details. Additionally, it annotates hand‑held objects and accessories to help models learn object‑object relationships.
Champion Solution – Algorithm Details
In the preliminary round, the team reproduced and improved the STM (Space‑Time Memory Networks) model. In the final round, they built on the semi‑supervised model and added modules for object detection, saliency discrimination, and key‑frame selection to achieve high‑precision unsupervised video segmentation.
1. Supervised Video Person Segmentation
The semi‑supervised VOS task aims to segment the target object in all subsequent frames given the mask of the first frame.
1.1 Basic Framework
Introduce a Spatial Constrained Memory Reader to address STM’s lack of spatial continuity.
STM matches pixels based solely on appearance, ignoring spatial continuity between consecutive frames. This can cause errors when multiple similar‑looking objects appear. The DAVIS2020 first‑place solution incorporated the previous frame’s mask into the encoded features, but this overly strong positional prior reduced non‑local matching ability.
Figure 1. Spatial‑constrained STM
The DAVIS2020 third‑place solution used a kernelized memory reader, which avoids one‑to‑many matching but still lacks spatial continuity.
Figure 2. Kernelized‑memory STM
We propose a method that leverages the previous frame’s mask without affecting the original matching training. A Gaussian kernel generated from the previous mask provides a spatial prior that corrects the optimal matching positions in memory, preserving non‑local matching capability while ensuring spatial continuity.
Figure 3. Spatial‑constrained Memory Reader diagram
Add ASPP & HRNet post‑refinement to improve multi‑scale segmentation detail.
ASPP captures multi‑scale information, and HRNet refines the initial STM output, enhancing object boundary details.
1.2 Training Strategy
Two‑stage training was employed:
Stage 1: Pre‑train on MS‑COCO static images converted to video sequences.
Stage 2: Merge public datasets (DAVIS, YouTube‑VOS) with the competition training set for further training.
Key training details include cropping three consecutive frames for augmentation, ensuring the object appears in the first frame, mixing datasets in proper ratios, using a poly learning‑rate schedule to stabilize loss fluctuations, and fixing all batch‑norm layers when batch size is small due to high memory consumption.
1.3 Other Enhancements
Backbone upgraded to ResNeSt‑101.
Test strategy: multi‑scale and flip inference.
1.4 Results
The Youku team’s model achieved a score of 95.5 on the test set, an improvement of nearly 5 points over the original STM.
2. Unsupervised Video Person Segmentation
The unsupervised VOS task aims to discover foreground objects and segment them continuously without any annotation.
2.1 Algorithm Framework
The final‑round pipeline consists of four steps:
a. Frame‑wise instance segmentation
DetectoRS is used as the detector, trained on MS‑COCO without fine‑tuning on the competition data. Only the "person" class is retained, with a low confidence threshold (0.1) to keep many proposals.
b. Mask post‑processing
Instance masks from DetectoRS are low‑resolution and coarse. A semantic segmentation model (HRNet) refines these masks (image + mask → HRNet → refined mask).
Figure 4. DetectoRS output mask (top) and refined mask (bottom)
c. Inter‑frame data association
STM warps the mask from frame t‑1 to frame t, then matches the warped mask with DetectoRS proposals using the Hungarian algorithm. High‑confidence proposals (≥0.8) are kept for the first frame.
d. Key‑frame selection and iterative optimization
Frames with good segmentation quality are selected as key frames. Their masks are used as memory for bidirectional STM prediction, which helps recover missed detections at video start and improves handling of occlusions. Each iteration yields a modest metric gain.
Video object segmentation (VOS) is a recognized technical challenge with broad real‑world demand. Participants of the "New Content, New Interaction" challenge are expected to build on these techniques to deliver smarter, more convenient, and more engaging video services.
Youku Technology
Discover top-tier entertainment technology here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.