Alibaba VOS Innovations: Semi-supervised, Interactive & Unsupervised Segmentation
Video Object Segmentation (VOS) is essential for content creation, and Alibaba’s research outlines three main approaches—semi-supervised, interactive, and unsupervised—detailing their algorithms, challenges, evaluation metrics, recent breakthroughs, and future plans to improve accuracy in complex scenes.
Introduction
Video Object Segmentation (VOS) aims to extract foreground objects from video frames. It is crucial for content re‑creation such as 3D video effects, and currently consumes over 99% of creators’ time.
For platforms like Youku, VOS greatly enhances production efficiency, especially interactive VOS which improves accuracy with minimal user interaction.
Research Directions
The computer‑vision community focuses on three VOS directions:
Semi‑supervised VOS
Interactive VOS
Unsupervised VOS
These correspond to the three tracks of the DAVIS 2019 Challenge.
Semi‑supervised VOS
Also known as one‑shot VOS (OSVOS), it uses a ground‑truth mask on the first frame to segment subsequent frames. Challenges include similar foreground/background colors, motion, illumination changes, and occlusions.
Algorithms are divided into online‑learning (fine‑tuning per object, e.g., Lucid datadreaming, OSVOS, PreMVOS) and offline‑learning (pre‑trained models, e.g., FEELVOS, Space‑time memory network). Evaluation uses mean Jaccard and F‑measure.
Interactive VOS
Interactive VOS accepts user inputs (bounding boxes, scribbles, edge points) on any frame, propagates segmentation using semi‑supervised methods, and iteratively refines results. The typical pipeline includes five steps: user input, image‑level segmentation, temporal propagation, result update, and repetition until satisfaction.
Performance is measured by J&F@60s and AUC, emphasizing speed. Unlike online‑learning semi‑supervised methods, interactive VOS avoids heavy fine‑tuning, offering higher usability.
Unsupervised VOS
Unsupervised VOS requires only the RGB video and aims to automatically segment salient objects. It adds a saliency detection module and participates in the DAVIS and YouTube‑VOS challenges.
Alibaba Entertainment MoCo Lab Progress
Since March 2019 the lab has pursued semi‑supervised and interactive VOS. By May 2019 they achieved J&F@60s = 0.761 for interactive VOS and J&F = 0.763 for semi‑supervised VOS, ranking fourth in the DAVIS interactive track.
Future work targets complex scenarios (small objects, similar foreground/background, fast motion, severe occlusion) by improving online learning, space‑time networks, and region proposal/verification.
References
The 2019 DAVIS Challenge on VOS: Unsupervised Multi‑Object Segmentation. Caelles et al., arXiv:1905.00737, 2019.
Lucid datadreaming for object tracking. Khoreva et al., arXiv:1703.09554, 2017.
One‑shot video object segmentation. Caelles et al., CVPR, 2017.
PReMVOS: Proposal‑generation, refinement and merging for video object segmentation. Luiten et al., arXiv, 2018.
FEELVOS: Fast End‑to‑End Embedding Learning for Video Object Segmentation. Voigtlaender et al., CVPR 2019.
Fast User‑Guided Video Object Segmentation by Interaction‑and‑Propagation Networks. Oh et al., CVPR 2019.
Fast Online Object Tracking and Segmentation: A Unifying Approach. Wang et al., CVPR 2019.
Robust Multiple Object Mask Propagation with Efficient Object Tracking. Ren et al., CVPR Workshops 2019.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
