Alibaba VOS Innovations: Semi-supervised, Interactive & Unsupervised Segmentation

Video Object Segmentation (VOS) is essential for content creation, and Alibaba’s research outlines three main approaches—semi-supervised, interactive, and unsupervised—detailing their algorithms, challenges, evaluation metrics, recent breakthroughs, and future plans to improve accuracy in complex scenes.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba VOS Innovations: Semi-supervised, Interactive & Unsupervised Segmentation

Introduction

Video Object Segmentation (VOS) aims to extract foreground objects from video frames. It is crucial for content re‑creation such as 3D video effects, and currently consumes over 99% of creators’ time.

Video object segmentation example
Video object segmentation example

For platforms like Youku, VOS greatly enhances production efficiency, especially interactive VOS which improves accuracy with minimal user interaction.

Research Directions

The computer‑vision community focuses on three VOS directions:

Semi‑supervised VOS

Interactive VOS

Unsupervised VOS

These correspond to the three tracks of the DAVIS 2019 Challenge.

Semi‑supervised VOS

Also known as one‑shot VOS (OSVOS), it uses a ground‑truth mask on the first frame to segment subsequent frames. Challenges include similar foreground/background colors, motion, illumination changes, and occlusions.

Semi-supervised VOS example
Semi-supervised VOS example

Algorithms are divided into online‑learning (fine‑tuning per object, e.g., Lucid datadreaming, OSVOS, PreMVOS) and offline‑learning (pre‑trained models, e.g., FEELVOS, Space‑time memory network). Evaluation uses mean Jaccard and F‑measure.

Interactive VOS

Interactive VOS accepts user inputs (bounding boxes, scribbles, edge points) on any frame, propagates segmentation using semi‑supervised methods, and iteratively refines results. The typical pipeline includes five steps: user input, image‑level segmentation, temporal propagation, result update, and repetition until satisfaction.

Interactive VOS pipeline
Interactive VOS pipeline

Performance is measured by J&F@60s and AUC, emphasizing speed. Unlike online‑learning semi‑supervised methods, interactive VOS avoids heavy fine‑tuning, offering higher usability.

Unsupervised VOS

Unsupervised VOS requires only the RGB video and aims to automatically segment salient objects. It adds a saliency detection module and participates in the DAVIS and YouTube‑VOS challenges.

J&F curve example
J&F curve example

Alibaba Entertainment MoCo Lab Progress

Since March 2019 the lab has pursued semi‑supervised and interactive VOS. By May 2019 they achieved J&F@60s = 0.761 for interactive VOS and J&F = 0.763 for semi‑supervised VOS, ranking fourth in the DAVIS interactive track.

Future work targets complex scenarios (small objects, similar foreground/background, fast motion, severe occlusion) by improving online learning, space‑time networks, and region proposal/verification.

References

The 2019 DAVIS Challenge on VOS: Unsupervised Multi‑Object Segmentation. Caelles et al., arXiv:1905.00737, 2019.

Lucid datadreaming for object tracking. Khoreva et al., arXiv:1703.09554, 2017.

One‑shot video object segmentation. Caelles et al., CVPR, 2017.

PReMVOS: Proposal‑generation, refinement and merging for video object segmentation. Luiten et al., arXiv, 2018.

FEELVOS: Fast End‑to‑End Embedding Learning for Video Object Segmentation. Voigtlaender et al., CVPR 2019.

Fast User‑Guided Video Object Segmentation by Interaction‑and‑Propagation Networks. Oh et al., CVPR 2019.

Fast Online Object Tracking and Segmentation: A Unifying Approach. Wang et al., CVPR 2019.

Robust Multiple Object Mask Propagation with Efficient Object Tracking. Ren et al., CVPR Workshops 2019.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Computer VisioninteractiveAIvideo object segmentationunsupervisedsemi-supervised
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.