Artificial Intelligence 15 min read

VideoPainter: Plug‑and‑Play Video Inpainting and Editing Sets 8 SOTA Benchmarks

VideoPainter introduces a plug‑and‑play dual‑branch framework for video inpainting and editing, featuring a lightweight context encoder, ID‑consistent resampling, and the large VPData/VPBench datasets, and achieves state‑of‑the‑art results across eight quantitative and qualitative metrics.

AIWalker

Mar 16, 2025

VideoPainter: Plug‑and‑Play Video Inpainting and Editing Sets 8 SOTA Benchmarks

Overview

VideoPainter is the first plug‑and‑play video inpainting framework that supports arbitrary‑length video repair and editing with background control. It combines a lightweight context encoder with a dual‑branch architecture to decouple background preservation from foreground generation, and introduces an ID‑resampling strategy to maintain identity consistency over long videos.

Problem Statement

Existing methods struggle with fully masked targets.

Balancing background retention and foreground synthesis is difficult.

Long‑duration videos often lose consistent object IDs.

Proposed Solution

VideoPainter framework : a dual‑branch design where one branch handles background control and the other generates foreground content.

Lightweight context encoder : occupies only 6% of backbone parameters and injects dense background cues into any pretrained video diffusion Transformer (DiT).

ID‑resampling adapter : a LoRA‑based module that resamples ID tokens from previous clips to enforce temporal identity consistency.

VPData and VPBench : a scalable data pipeline that yields the largest video‑inpainting dataset (≈390K clips, >866.7 h) with precise masked annotations and dense textual descriptions.

Technical Details

The context encoder concatenates noise latent, masked video latent, and down‑sampled masked features, aligning them with the pretrained DiT latent space. Feature integration follows a two‑stage scheme: the first encoder layer augments the early DiT layers, the second augments the later layers, while a token‑selection mask filters out foreground tokens to avoid confusion.

For ID‑resampling, during training the DiT and encoder are frozen; a trainable LoRA adapter injects ID tokens into the KV cache of the DiT, ensuring that the same object ID persists across clip boundaries.

Plug‑and‑Play Control

The framework works with any DiT backbone, supports LoRA style adapters, and is compatible with both text‑to‑video (T2V) and image‑to‑video (I2V) pipelines. When using an I2V backbone, an additional step generates an initial frame with an image‑inpainting model guided by the masked region’s textual description.

Experiments

Training uses VPData at 480×720 resolution, a learning rate of 1e‑4, batch size 1, and AdamW optimizer on 64 NVIDIA V100 GPUs (80 k steps for the encoder, 2 k steps for the ID adapter).

Benchmarks include Davis (random masks) and VPBench (segmentation‑based masks). Eight metrics are evaluated: PSNR, LPIPS, SSIM, MSE, MAE for masked‑region preservation; CLIP similarity (overall and masked) for text alignment; and FVID for video quality.

Quantitative results show VideoPainter outperforms prior methods (ProPainter, COCOCO, Cog‑Inp) on all metrics for both short and long videos. Qualitative comparisons highlight superior temporal consistency, background fidelity, and text‑prompt alignment.

Human studies with 30 participants on 50 randomly selected cases confirm significant preference for VideoPainter in background retention, text alignment, and overall video quality.

Ablation Study

Two‑layer context encoder offers the best trade‑off between performance and efficiency.

Mask‑selective feature fusion prevents foreground‑background token confusion.

Plug‑and‑play control works across different backbones with comparable performance.

ID‑resampling is crucial for long‑video identity consistency, as demonstrated by rows 7‑8 in the ablation table.

Discussion

VideoPainter’s strengths lie in its modular plug‑and‑play design, efficient context encoding, and robust ID‑consistent handling, enabling state‑of‑the‑art video repair and editing. Limitations include dependence on the quality of the underlying diffusion model and degraded performance on low‑quality masks or mismatched textual descriptions.

References

[1] VideoPainter: Any‑length Video Inpainting and Editing with Plug‑and‑Play Context Control.

Diffusion Models Video Inpainting Plug-and-Play context encoder Dual-Branch Architecture ID resampling large video dataset

Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.