Kuaishou's Custom Video Matting Solution: Interactive Object Segmentation for Mobile Creators
Kuaishou's audio‑video technology team presents a self‑developed custom video matting system that combines foreground, interactive, and video object segmentation to let creators extract arbitrary subjects without green screens, featuring adaptive cropping, multi‑stage training, and deployment across Android and iOS devices.
Kuaishou's audio‑video technology team has built a custom video matting solution that enables creators to precisely extract any target object from images or videos without requiring green‑screen setups or professional Photoshop skills. The system offers both automatic and manual modes within the Kuaishou App, supports a wide range of subjects (people, pets, food, plants, etc.), and has resulted in over ten patents as well as papers presented at CVPR 2021 and NeurIPS 2021.
Traditional post‑production tools and green‑screen techniques are inaccessible to casual creators, and existing algorithms are limited to specific backgrounds or human subjects. To address these pain points, Kuaishou developed a three‑branch algorithm suite—foreground segmentation, interactive segmentation, and video object segmentation—operating in two modes: single‑frame image mode and video‑frame mode (see Figure 1).
In single‑frame image mode, users select a template frame, after which a foreground segmentation model generates an initial mask. Users can then refine this mask through interactive segmentation, which incorporates positive and negative scribbles to correct missing or over‑segmented regions.
The foreground segmentation component can isolate any foreground object, as illustrated by the results in Figure 2. To overcome limited benchmark datasets, Kuaishou collected large‑scale, diverse data and employed semi‑automatic pseudo‑label generation, model ensembling, and extensive data augmentation to improve mask quality.
Interactive segmentation is driven by user scribbles (positive and negative). Each interaction triggers an adaptive local dynamic crop around the scribbled region, allowing the CNN to process only the relevant area and making each round independent, which greatly improves fault tolerance. The process of generating positive/negative scribble masks and the adaptive cropping are shown in Figures 5 and 6, while the inference pipeline is depicted in Figure 4.
Video‑frame mode extends the approach to continuous video streams. After obtaining a template mask, a video object segmentation model propagates the mask across subsequent frames, allowing users to iteratively select new template frames and refine results. The core challenge is accurate, stable segmentation across devices, addressed by a memory‑bank architecture with spatio‑temporal attention (Figure 7) and an auxiliary decoder used only during training.
Training follows a two‑stage strategy: first, static images are transformed (affine, color jitter, rotation, etc.) to synthesize pseudo‑video frames (Figures 8‑9); then, the model is fine‑tuned on a small set of manually annotated video data. This leverages abundant image‑segmentation data to boost performance while reducing development time.
To ensure temporal stability, a lightweight frame‑to‑frame fusion module and post‑processing smoothing are applied, reducing flicker without adding significant computational overhead (see the CVPR 2021 paper linked in the text).
The solution has been deployed on both Android and iOS platforms, with twelve model tiers tailored to different hardware capabilities to guarantee efficient performance and high visual quality across devices.
Users can try the feature by opening the Kuaishou App, selecting "Start Editing," choosing "Intelligent Matting," and then opting for the manual mode to perform custom matting.
References: CVPR2021 – Modular Interactive Video Object Segmentation (MIVOS) and related NeurIPS2021 work.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.