Artificial Intelligence 10 min read

How Deep Learning Revolutionizes Video Matting: A Two‑Stage Framework

Leveraging deep neural networks, this article introduces a pioneering two‑stage video matting framework that propagates sparse Trimap annotations across frames using cross‑attention, integrates spatio‑temporal features via a ST‑FAM module, and demonstrates superior performance on synthetic and real HD video datasets.

Kuaishou Audio & Video Technology
Kuaishou Audio & Video Technology
Kuaishou Audio & Video Technology
How Deep Learning Revolutionizes Video Matting: A Two‑Stage Framework

01 Background

Image matting is a key technique in image and video processing, widely used in photo/video editing and film production. Traditional methods rely on low‑level color cues, limiting their performance in complex scenes. Recent advances in deep learning enable extraction of high‑level semantic features, dramatically improving matting quality and making deep‑learning‑based matting the mainstream. The surge of short‑video platforms has further driven demand for high‑quality video matting.

02 Problem

Video matting faces additional challenges compared to image matting: lack of large‑scale video matting datasets, the impracticality of providing a Trimap for every frame, and the need to maintain temporal consistency across frames to avoid flickering.

03 Solution

To address these issues, Kuaishou and Hong Kong University of Science and Technology proposed the first deep‑learning‑based video matting framework. It operates in two stages: (1) Trimap propagation from a few manually annotated key frames to the whole video using cross‑attention, and (2) spatio‑temporal feature fusion (ST‑FAM) to generate high‑quality Alpha mattes without computing optical flow.

Trimap Propagation

The traditional Trimap propagation relies on optical flow, which struggles with fine structures and transparent regions. Instead, the proposed method uses a cross‑attention mechanism to align features between a reference frame (with Trimap) and target frames, automatically generating Trimaps for all frames and greatly reducing annotation cost.

Two weight‑sharing encoders extract semantic features from the reference frame F_r and the target frame F_t .

A cross‑attention network computes pixel‑wise similarity between the frames; foreground pixels in the target frame are matched to foreground pixels in the reference frame, producing aligned features.

A decoder reconstructs the aligned features and performs three‑class classification to output the final Trimap.

This approach works even when only the first frame’s Trimap is provided, and it scales to scenes with large foreground motion while avoiding the limitations of optical‑flow‑based methods.

Spatio‑Temporal Feature Fusion Module (ST‑FAM)

In the second stage, the framework fuses multi‑scale spatial features and neighboring‑frame temporal information to enhance the target frame’s representation. The ST‑FAM consists of two sub‑modules: Temporal Feature Alignment (TFA) and Temporal Feature Fusion (TFF).

TFA Module

The TFA module aligns features of adjacent frames. For each pixel in a feature map at time t , it predicts a displacement vector (motion offset) and uses deformable convolution to warp the neighboring frame’s features onto the current frame, achieving automatic temporal alignment.

After alignment, the fused features may contain noise. To mitigate this, the authors apply attention mechanisms: channel attention (via global average pooling) to weight useful channels, followed by spatial attention to enhance pixel‑wise interactions, reducing interference from irrelevant information.

04 Experiments and Comparison

The authors evaluate the method on a large synthetic video matting dataset and on real high‑definition videos. Quantitative results show the algorithm outperforms existing methods, and qualitative examples demonstrate stable performance on real footage.

05 Related Links

Qiqi Hou and Feng Liu, “Context‑aware image matting for simultaneous foreground and alpha estimation,” ICCV 2019.

Yaoyi Li and Hongtao Lu, “Natural image matting via guided contextual attention,” AAAI 2020.

Hao Lu et al., “Indices matter: Learning to index for deep image matting,” ICCV 2019.

Ning Xu et al., “Deep image matting,” CVPR 2017.

Yunke Zhang et al., “A late fusion CNN for digital matting,” CVPR 2019.

computer visionDeep Learningvideo mattingspatio-temporal fusiontrimap propagation
Kuaishou Audio & Video Technology
Written by

Kuaishou Audio & Video Technology

Explore the stories behind Kuaishou's audio and video technology.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.