Fast and Precise: FloED Sets New State‑of‑the‑Art in Video Restoration Over All Diffusion Models

FloED introduces a dual‑branch, flow‑guided diffusion framework that dramatically improves spatio‑temporal consistency and computational efficiency for video restoration, outperforming existing text‑guided diffusion methods on both object removal and background repair benchmarks.

AIWalker
AIWalker
AIWalker
Fast and Precise: FloED Sets New State‑of‑the‑Art in Video Restoration Over All Diffusion Models

Highlights

Novel video restoration model : a dedicated dual‑branch architecture integrates flow adapters to enhance spatio‑temporal consistency.

Efficient denoising : a training‑free latent‑space interpolation guided by optical flow reduces the extra cost of flow computation.

State‑of‑the‑art performance : extensive quantitative and qualitative experiments show FloED surpasses other diffusion‑based methods in both quality and speed.

Problem Statement

Current diffusion‑based video restoration approaches struggle with three issues: (1) insufficient spatio‑temporal consistency, leading to texture and lighting artifacts; (2) high computational cost due to multi‑step denoising plus optical‑flow estimation; and (3) poor adaptability to both background‑repair (BR) and object‑removal (OR) tasks while maintaining text‑alignment.

Proposed Solution

Dual‑branch architecture : a main restoration branch (Stable Diffusion Inpainting backbone) and a time‑independent flow branch that first restores corrupted optical flow and then injects motion information via multi‑scale flow adapters.

Flow adapters : cross‑attention modules inspired by IP‑Adapter that fuse reconstructed flow features into the UNet’s up‑sampling blocks, providing motion guidance without interfering with textual cross‑attention.

Anchor‑frame strategy : an extra high‑quality image‑inpainting model repairs a selected anchor frame, which is concatenated with noisy video frames to supply texture guidance during denoising.

Training‑free latent‑space interpolation : optical‑flow‑guided warping interpolates latent features in early denoising steps, dramatically cutting the number of full denoising passes.

Flow‑attention cache : attention keys/values from the first flow computation are cached and reused in later steps, eliminating redundant flow‑related calculations.

Method Details

Given a text prompt, an original video sequence, and a binary mask sequence, corrupted frames are obtained via a Hadamard product. The model aims to generate temporally consistent, text‑aligned restorations that blend seamlessly with surrounding context.

Network Overview

FloED uses a pretrained Stable Diffusion Inpainting backbone as the main branch and incorporates an AnimateDiff‑v3 motion module. Training proceeds in two stages: (1) fine‑tune the motion module for video‑repair temporal modeling; (2) add the flow branch, multi‑scale flow adapters, anchor‑frame strategy, and the training‑free acceleration technique.

Flow Branch

The branch aligns with the main UNet, removing temporal inputs from ResNet blocks to keep flow features time‑independent.

Repaired flow is injected into the main UNet via multi‑scale adapters, providing global motion guidance.

Flow Adapter

Each adapter consists of a cross‑attention layer that receives flow features as additional keys/values.

Placed between textual cross‑attention and the motion module, it dynamically adjusts latent features based on flow priors.

Anchor‑Frame Strategy

An image‑inpainting diffusion model first restores a selected anchor frame. The repaired anchor is concatenated with noisy video frames, offering high‑quality texture cues during denoising; the anchor is discarded after the process.

Efficient Inference

Latent‑space interpolation is applied only in the early denoising steps (first five steps), using flow‑guided warping.

Even‑odd frame processing: even frames undergo full denoising, odd frames are generated by cheap warping from the even ones.

Flow‑attention cache stores K/V pairs from the flow adapters after the first step, reusing them thereafter.

Experiments

Implementation Details

Dataset: Open‑Sora‑Plan provides 421,396 high‑quality 4K video clips (100 frames each). A custom benchmark of 100 videos (50 OR, 50 BR) is built from Pexels and Pixabay.

Training: Stage 1 – 5 epochs, batch 8 on eight NVIDIA A800 GPUs; Stage 2 – 30 epochs, batch 128 with gradient accumulation, λ = 0.1.

Inference: DDIM sampler, 25 denoising steps, acceleration step S = 5.

Comparative Experiments

Baselines: VideoComposer, CoCoCo, DiffuEraser (all open‑source text‑guided diffusion methods).

Qualitative: FloED fills masks with coherent content, avoiding the visual artifacts and hallucinations seen in baselines.

Quantitative: Metrics on BR – PSNR, VFID, SSIM, temporal consistency (TC) via CLIP‑image cosine similarity; on OR – text‑alignment (TA) via CLIP score. FloED outperforms all baselines across every metric (see Table 1).

User Study: 15 annotators evaluated 100 videos on temporal coherence, text alignment, and contextual compatibility. FloED achieved selection rates of 62.27 % (BR) and 56.40 % (OR), the highest among competitors.

Ablation Studies

Flow Completion: Restored flow yields temporally consistent repairs (see Fig. 5, cases B vs C).

Flow Adapter: Multi‑scale adapters significantly improve scene compatibility and temporal coherence (D vs E, Table 2).

Flow Warping Placement: Moving warping to pure latent space (Formula 4) avoids error accumulation observed in baseline F.

Efficiency Ablations: Applying latent‑space interpolation only in the first five steps yields a 13.4 % speed‑up at 432 × 240 resolution without noticeable quality loss (Table 3). Combining flow cache and interpolation outperforms a version without any flow module (Table 4).

Discussion

The work focuses on text‑guided video restoration and compares primarily with diffusion‑based methods. The latent‑space interpolation technique can be transferred to other models such as CoCoCo. A limitation is that pre‑repairing corrupted flow may restrict cross‑scene generalization.

Conclusion

FloED is a flow‑guided, dual‑branch diffusion framework that delivers superior spatio‑temporal consistency and computational efficiency for video restoration. By integrating a time‑independent flow branch, multi‑scale flow adapters, an anchor‑frame strategy, and training‑free latent‑space interpolation with attention caching, FloED achieves state‑of‑the‑art results on both background‑repair and object‑removal tasks.

References

[1] Coherent Video Inpainting Using Optical Flow‑Guided Efficient Diffusion

efficiencyDiffusion Modelsvideo restorationoptical flowFloED
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.