Artificial Intelligence 13 min read

Champion Solution of Media AI Alibaba Entertainment Video Object Segmentation Challenge

The Youku AI team won the Media AI Alibaba Entertainment Video Object Segmentation Challenge by enhancing the STM model with a spatial‑constrained memory reader, ASPP‑HRNet refinement, ResNeSt‑101 backbone, and a multi‑stage training pipeline, while also devising an unsupervised framework that combines DetectoRS detection, HRNet mask refinement, STM‑based association, and key‑frame optimization to achieve 95.5% test score on a large, richly annotated video dataset.

Youku Technology
Youku Technology
Youku Technology
Champion Solution of Media AI Alibaba Entertainment Video Object Segmentation Challenge

Video target segmentation is a cutting‑edge topic in video algorithms, with growing applications across many industry scenarios. The "New Content, New Interaction" Global Video Cloud Innovation Challenge, co‑hosted by Intel and Alibaba Cloud and partnered with Youku, focuses on this field and has attracted over 2,000 teams worldwide.

This article uses the Media AI Alibaba Entertainment Algorithm Challenge as an example and presents the champion solution proposed by the Youku AI Department algorithm team, offering participants technical insights and practical experience.

Beyond the usual difficulties of perspective, illumination, scale variation, and occlusion, video person‑segmentation for intelligent video production must also address:

Rich and diverse scene content: the algorithm must correctly identify the main subject under complex background interference.

Complex clothing, hand‑held objects, and accessories: the algorithm must capture detailed semantic appearance.

Rapid, intense human motion: the algorithm must handle motion blur and severe deformation to avoid mis‑segmentation.

The Media AI competition dataset targets high‑precision, instance‑level video person segmentation, providing 1,700 finely annotated video clips (800 training and 50 test clips for each of the preliminary and final rounds), covering the above challenges.

Compared with standard academic/industrial datasets such as DAVIS and YouTube‑VOS, this dataset contains the industry’s largest number of human‑target annotations (180k frames, 300k human instances) and leads in annotation accuracy and content breadth. It includes a wide variety of content—historical dramas, modern series, street shots, dance, sports, etc.—and provides precise mask ground‑truth for fine‑grained edge details. Additionally, it annotates hand‑held objects and accessories to help models learn object‑object relationships.

Champion Solution – Algorithm Details

In the preliminary round, the team reproduced and improved the STM (Space‑Time Memory Networks) model. In the final round, they built on the semi‑supervised model and added modules for object detection, saliency discrimination, and key‑frame selection to achieve high‑precision unsupervised video segmentation.

1. Supervised Video Person Segmentation

The semi‑supervised VOS task aims to segment the target object in all subsequent frames given the mask of the first frame.

1.1 Basic Framework

Introduce a Spatial Constrained Memory Reader to address STM’s lack of spatial continuity.

STM matches pixels based solely on appearance, ignoring spatial continuity between consecutive frames. This can cause errors when multiple similar‑looking objects appear. The DAVIS2020 first‑place solution incorporated the previous frame’s mask into the encoded features, but this overly strong positional prior reduced non‑local matching ability.

Figure 1. Spatial‑constrained STM

The DAVIS2020 third‑place solution used a kernelized memory reader, which avoids one‑to‑many matching but still lacks spatial continuity.

Figure 2. Kernelized‑memory STM

We propose a method that leverages the previous frame’s mask without affecting the original matching training. A Gaussian kernel generated from the previous mask provides a spatial prior that corrects the optimal matching positions in memory, preserving non‑local matching capability while ensuring spatial continuity.

Figure 3. Spatial‑constrained Memory Reader diagram

Add ASPP & HRNet post‑refinement to improve multi‑scale segmentation detail.

ASPP captures multi‑scale information, and HRNet refines the initial STM output, enhancing object boundary details.

1.2 Training Strategy

Two‑stage training was employed:

Stage 1: Pre‑train on MS‑COCO static images converted to video sequences.

Stage 2: Merge public datasets (DAVIS, YouTube‑VOS) with the competition training set for further training.

Key training details include cropping three consecutive frames for augmentation, ensuring the object appears in the first frame, mixing datasets in proper ratios, using a poly learning‑rate schedule to stabilize loss fluctuations, and fixing all batch‑norm layers when batch size is small due to high memory consumption.

1.3 Other Enhancements

Backbone upgraded to ResNeSt‑101.

Test strategy: multi‑scale and flip inference.

1.4 Results

The Youku team’s model achieved a score of 95.5 on the test set, an improvement of nearly 5 points over the original STM.

2. Unsupervised Video Person Segmentation

The unsupervised VOS task aims to discover foreground objects and segment them continuously without any annotation.

2.1 Algorithm Framework

The final‑round pipeline consists of four steps:

a. Frame‑wise instance segmentation

DetectoRS is used as the detector, trained on MS‑COCO without fine‑tuning on the competition data. Only the "person" class is retained, with a low confidence threshold (0.1) to keep many proposals.

b. Mask post‑processing

Instance masks from DetectoRS are low‑resolution and coarse. A semantic segmentation model (HRNet) refines these masks (image + mask → HRNet → refined mask).

Figure 4. DetectoRS output mask (top) and refined mask (bottom)

c. Inter‑frame data association

STM warps the mask from frame t‑1 to frame t, then matches the warped mask with DetectoRS proposals using the Hungarian algorithm. High‑confidence proposals (≥0.8) are kept for the first frame.

d. Key‑frame selection and iterative optimization

Frames with good segmentation quality are selected as key frames. Their masks are used as memory for bidirectional STM prediction, which helps recover missed detections at video start and improves handling of occlusions. Each iteration yields a modest metric gain.

Video object segmentation (VOS) is a recognized technical challenge with broad real‑world demand. Participants of the "New Content, New Interaction" challenge are expected to build on these techniques to deliver smarter, more convenient, and more engaging video services.

computer visiondeep learningSemi-supervised LearningSpatial Memory NetworksUnsupervised VOSVideo Object Segmentation
Youku Technology
Written by

Youku Technology

Discover top-tier entertainment technology here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.