Artificial Intelligence 10 min read

MiVOS: Achieving Precise Video Segmentation with Minimal User Interaction

MiVOS introduces a highly decoupled, three‑module framework—Interaction‑to‑Mask, Mask Propagation, and Difference‑aware Fusion—for interactive video object segmentation, delivering precise masks with fewer user interactions, validated on the DAVIS benchmark and supported by a new large‑scale synthetic VOS dataset (BL30K).

Kuaishou Audio & Video Technology
Kuaishou Audio & Video Technology
Kuaishou Audio & Video Technology
MiVOS: Achieving Precise Video Segmentation with Minimal User Interaction

The authors propose a modular interactive video object segmentation algorithm (MiVOS) composed of three highly decoupled modules: Interaction‑to‑Mask, Mask Propagation, and Difference‑aware Fusion. This decoupling improves performance and generalization.

Abstract

MiVOS enables users to obtain object masks conveniently via a separately trained single‑frame interactive segmentation module. Using various user interactions (e.g., scribbles, clicks), the method is evaluated qualitatively and quantitatively on the DAVIS dataset, showing superior accuracy with fewer interaction frames compared to state‑of‑the‑art algorithms. A large‑scale synthetic VOS dataset (BL30K) is also released to foster further research.

Background

Video Object Segmentation (VOS) is fundamental for video scene understanding and editing. Interactive VOS (iVOS) allows iterative user refinement of segmentation results, which is valuable for short‑video editing, special‑effects creation, and content creation.

Problem

Existing iVOS methods tightly couple interaction understanding and temporal mask propagation, limiting interaction diversity and making training difficult. Prior attempts at decoupling fail to fully exploit user intent during propagation, hindering performance.

Method

Interaction‑to‑Mask : A Scribble‑to‑Mask (S2M) network based on DeepLabV3+ takes six channels (RGB image, existing mask, positive/negative scribbles) and produces a mask for a single frame. It supports clicks, scribbles, and local refinement.

Mask Propagation : Inspired by STM, past frames with masks serve as memory. An attention‑based memory read predicts the current frame’s mask. A novel Top‑k filtering strategy is integrated to improve speed and accuracy without complex training tricks.

Difference‑aware Fusion : This module captures user intent by fusing the current propagated mask with the previous round’s mask under guidance from mask differences, mitigating information loss from decoupling.

BL30K Dataset

The authors contribute a synthetic VOS dataset (BL30K) containing 4.8 M frames with pixel‑level annotations, the largest publicly available VOS dataset to date.

Experiments

Extensive ablation studies on DAVIS 2020 interactive segmentation track demonstrate that each module contributes positively to overall performance. MiVOS achieves higher segmentation quality with fewer interaction rounds than current SOTA methods, as shown in quantitative tables and qualitative visual comparisons.

Conclusion

MiVOS presents a simple, effective, and highly generalizable modular framework for interactive video segmentation, supported by a new large‑scale synthetic dataset, and demonstrates state‑of‑the‑art performance with reduced user effort.

computer visiondeep learningvideo segmentationinteractive segmentationmodular network
Kuaishou Audio & Video Technology
Written by

Kuaishou Audio & Video Technology

Explore the stories behind Kuaishou's audio and video technology.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.