How DeViT Revolutionizes Video Inpainting with Deformed Vision Transformers

The article introduces DeViT, a novel Deformed Vision Transformer framework for video inpainting that leverages a deformable patch homography estimator, mask‑pruned attention, and spatio‑temporal weight adaptation, achieving state‑of‑the‑art results on benchmark datasets and highlighting its potential for advanced video editing tools.

Kuaishou Audio & Video Technology
Kuaishou Audio & Video Technology
Kuaishou Audio & Video Technology
How DeViT Revolutionizes Video Inpainting with Deformed Vision Transformers

Background

Video inpainting is essential for tasks such as object removal, damage restoration, and AR integration, yet it remains less explored than image inpainting. The ACM MM 2022 conference accepted a paper titled “DeViT: Deformed Vision Transformers in Video Inpainting,” presenting a new Transformer‑based approach.

Method

The proposed DeViT framework consists of three key components:

Deformed Patch Homography Estimator (DePtH) : Introduces a deformable patch homography estimator that learns offset vectors without extra supervision, enabling precise alignment of patch features for challenging motions and deformations.

Mask‑Pruned Patch Attention (MPPA) : Utilizes mask‑guided pruning to reduce the influence of invalid pixels, improving attention matching between deformed patches.

Spatio‑Temporal Weight Adapter (STA) : Dynamically allocates attention weights between spatial and temporal information based on video motion types, enhancing performance on diverse video scenarios.

DeViT algorithm framework
DeViT algorithm framework

Training Objective

The model is optimized with a pixel‑wise reconstruction loss combined with an adversarial loss to improve visual fidelity.

Experimental Results

DeViT was evaluated on two large public multimedia datasets across four metrics, outperforming current state‑of‑the‑art methods. Experiments categorized videos by motion type (static, slow translation, complex deformation) and demonstrated the effectiveness of DePtH, MPPA, and STA in each scenario.

Mask‑pruned attention module (MPPA)
Mask‑pruned attention module (MPPA)

Conclusion

DeViT provides a comprehensive framework for precise patch alignment and matching in video tasks, integrating DePtH for deformation handling, MPPA for refined attention, and STA for adaptive spatio‑temporal weighting, achieving superior quantitative and qualitative results in video inpainting.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

TransformerMultimediaDeViTVideo Inpainting
Kuaishou Audio & Video Technology
Written by

Kuaishou Audio & Video Technology

Explore the stories behind Kuaishou's audio and video technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.