VideoPainter: Plug‑and‑Play Video Inpainting and Editing Achieves 8 SOTA Benchmarks
VideoPainter introduces a plug‑and‑play dual‑branch framework with a lightweight context encoder and ID‑resampling adapter, built on the massive VPData/VPBench dataset, and demonstrates state‑of‑the‑art performance across eight video restoration and editing metrics, while supporting flexible model integration and long‑video consistency.
Overview
VideoPainter is a plug‑and‑play video inpainting and editing framework that supports arbitrary‑length videos through a dual‑branch architecture. It achieves state‑of‑the‑art results on eight evaluation metrics covering video quality, masked‑region preservation, and text‑video alignment.
Key Contributions
Dual‑branch design separates background retention from foreground generation.
Lightweight context encoder (≈6 % of backbone parameters) injects masked‑video features into a pre‑trained diffusion Transformer (DiT).
ID‑adapter (LoRA) resamples identity tokens to preserve object IDs across long clips.
VPData (≈390 K clips, >866.7 h) and VPBench, the largest video restoration dataset with precise masks and dense textual descriptions.
Extensive experiments showing superior performance on eight metrics.
Problem
Existing methods fail on fully masked targets.
Balancing background preservation with foreground generation is difficult.
Long‑video processing often loses identity consistency.
Solution
Dual‑branch VideoPainter framework : background branch keeps existing pixels, foreground branch generates new content.
Context encoder : processes noise latent, masked‑video latent, and down‑sampled masked features, then merges them into the first two layers of DiT.
ID‑adapter : trainable LoRA module concatenates masked tokens with KV vectors, forcing the model to resample IDs.
Plug‑and‑play control : compatible with any DiT backbone, works for both text‑to‑video (T2V) and image‑to‑video (I2V) pipelines.
Dataset Construction (VPData & VPBench)
The pipeline consists of five steps:
Collection : videos sourced from Videvo and Pexels (~450 K raw videos).
Annotation : Recognize Anything Model for open‑set labeling, Grounding DINO for bounding‑box detection, and SAM2 for high‑quality masks.
Segmentation : PySceneDetect identifies scene changes; clips are split into 10‑second segments and those shorter than 6 s are discarded.
Selection : filters based on Laion‑Aesthetic score, motion intensity measured by RAFT optical flow, and safety checked by Stable Diffusion Safety Checker.
Description : CogVLM2 and GPT‑4o generate dense video‑level and masked‑object captions on sampled keyframes.
Dual‑Branch Architecture
The context encoder concatenates noise latent, masked‑video latent, and down‑sampled masked features, aligning dimensions for integration into DiT. Only the first two layers of DiT are cloned (≈6 % of parameters) and enriched with grouped, token‑selected features. Token selection filters out non‑background tokens to avoid ambiguity.
ID Resampling for Long Videos
To ensure smooth transitions and identity consistency, overlapping generation and weighted averaging (inspired by AVID) are used. During training, a frozen DiT is augmented with a LoRA‑based ID‑resampling adapter; masked tokens are concatenated with KV vectors, forcing the model to resample IDs. At inference, masked tokens from the previous clip are concatenated with the current KV vectors, preserving ID continuity.
Plug‑and‑Play Control
The framework works with any stylized DiT backbone or LoRA and supports both T2V and I2V. For I2V, an initial frame is generated by any image‑inpainting model guided by the masked region’s text description, then used as the first masked video frame.
Experiments
Implementation Details
VideoPainter builds on the pre‑trained CogVideo‑5B‑I2V diffusion Transformer. Training uses VPData at 480×720 resolution, batch size 1, AdamW optimizer, 80 k steps for the context encoder and 2 k steps for the ID‑adapter, on 64 NVIDIA V100 GPUs.
Benchmarks
Evaluation uses Davis (random masked) and VPBench (segmentation‑based masked) for video restoration, and VPBench editing tasks (add, remove, replace, change). VPBench contains 100 six‑second videos for standard repair and 16 longer videos (>30 s) for long‑video repair.
Evaluation Metrics
Masked‑region preservation: PSNR, LPIPS, SSIM, MSE, MAE.
Text alignment: CLIP similarity (overall and masked region).
Video quality: FVID.
Quantitative Comparison
VideoPainter outperforms ProPainter, COCOCO, and the strong baseline Cog‑Inp on all eight metrics across both VPBench and Davis. The dual‑branch design decouples background and foreground, eliminating the trade‑off that harms single‑branch methods.
Qualitative Comparison
VideoPainter shows superior consistency, quality, and text alignment. Competing methods either fail on fully masked targets (ProPainter) or produce inconsistent IDs and artifacts (COCOCO, Cog‑Inp).
Video Editing
VideoPainter edits videos by generating new textual descriptions via visual‑language models and applying the same restoration pipeline. It surpasses UniEdit, DiTCtrl, and even end‑to‑end ReVideo on both standard and long‑video editing benchmarks.
Human Evaluation
30 participants rated 50 randomly selected cases on background retention, text alignment, and video quality. VideoPainter achieved significantly higher preference rates across all criteria.
Ablation Study
Removing the dual‑branch design, reducing the encoder to a single layer, disabling token‑selective fusion, or omitting ID‑resampling each degrades performance. The two‑layer encoder offers the best trade‑off between accuracy and efficiency; token‑selective fusion prevents foreground‑background confusion; ID‑resampling is essential for long‑video consistency.
Plug‑and‑Play Capability
VideoPainter integrates a community‑developed Gromit‑style LoRA despite a domain gap, confirming its flexibility to adapt to different base models for specific repair tasks.
Discussion
VideoPainter is the first plug‑and‑play dual‑branch video inpainting framework, featuring a lightweight context encoder, ID‑resampling for long‑video consistency, and a scalable dataset pipeline that produced VPData and VPBench. Experiments confirm its state‑of‑the‑art performance on eight metrics and its potential for downstream video editing applications.
Limitations
Generation quality depends on the underlying diffusion model; complex physics and motion may be poorly modeled.
Performance degrades on low‑quality masks or misaligned textual descriptions.
References
[1] VideoPainter: Any‑length Video Inpainting and Editing with Plug‑and‑Play Context Control
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
