Artificial Intelligence 13 min read

Hierarchical Masked 3D Diffusion Model for Video Outpainting

This paper introduces a hierarchical masked 3D diffusion model (M3DDM) that leverages mask modeling and global-frame cross‑attention to achieve temporally consistent video outpainting, proposes a hybrid coarse‑to‑fine inference pipeline to mitigate error accumulation in long videos, and demonstrates state‑of‑the‑art results on benchmark datasets.

DataFunTalk

Jan 25, 2024

Hierarchical Masked 3D Diffusion Model for Video Outpainting

Abstract Video outpainting expands video borders while preserving temporal consistency, a challenge beyond image outpainting. We present a novel diffusion‑based method, the Hierarchical Masked 3D Diffusion Model (M3DDM), which uses mask‑modeling training and global‑video cross‑attention to ensure consistent frame generation and reduce jitter. A hybrid coarse‑to‑fine inference pipeline further alleviates error accumulation in long videos, achieving state‑of‑the‑art performance.

1. Background

In e‑commerce scenarios, advertisers often provide videos whose aspect ratios do not match app display areas. Simple stretching degrades visual quality, so video outpainting is employed to extend video borders and adapt to required dimensions. Challenges include GPU memory limits that require segment‑wise inference while maintaining temporal consistency, and error accumulation in long‑duration videos.

2. Solution

To address these challenges we propose:

Building a 3D video diffusion model by adapting the 2D Stable Diffusion parameters.

Introducing a guided‑frame strategy with a novel masking scheme for training.

Incorporating globally sampled frames into the cross‑attention layers to provide holistic video context.

Designing a hybrid coarse‑to‑fine inference pipeline that first generates sparse key frames, then interpolates intermediate frames, and finally refines remaining regions with bidirectional guidance.

2.1 Training: Masked 3D Diffusion Model

The training pipeline follows standard diffusion modeling: a 3D U‑Net learns to denoise video clips corrupted with Gaussian noise, conditioned on binary masks indicating regions to be filled and on global frames encoded via a lightweight encoder. The loss function follows the conventional diffusion objective.

Masking strategies include full‑direction, single‑direction, dual‑direction, random single direction, and full masking, with respective probabilities 0.2, 0.1, 0.35, 0.1, and 0.25. Mask ratios are sampled uniformly from [0.15, 0.75]. Three training modes are used: (1) all frames masked, (2) first or first‑and‑last frames unmasked, and (3) each frame has a 0.5 chance of being unmasked, with ratios 0.3, 0.35, 0.35.

2.2 Inference: Hybrid Coarse‑to‑Fine Pipeline

For long videos, repeated segment inference can cause error propagation. Our hybrid pipeline first generates sparse key frames, then fills intermediate frames via interpolation, and finally applies dense bidirectional guidance to refine remaining gaps. This reduces the number of iterations needed for key‑frame generation and mitigates temporal drift.

3. Experimental Analysis

Quantitative results on Davis and YouTube‑VOS datasets show that M3DDM outperforms Dehan and a simple diffusion baseline across five metrics at 256‑pixel resolution. Qualitative comparisons demonstrate superior temporal consistency and smoother video generation.

4. Deployment

The algorithm has been deployed in Alibaba Mama’s Creative Center, enabling advertisers to automatically adjust video dimensions for various ad placements, thereby increasing coverage and traffic.

5. Conclusion

We introduced a mask‑modeling‑driven 3D diffusion framework for video outpainting, enhanced with global‑frame prompting and a hybrid coarse‑to‑fine inference strategy. Experiments confirm its effectiveness, and the system is now live in a commercial product with open‑source code.

References

[1] Rombach et al., 2022. High‑resolution image synthesis with latent diffusion models. [2] Sohl‑Dickstein et al., 2015. Deep unsupervised learning using nonequilibrium thermodynamics. [3] Ho et al., 2020. Denoising diffusion probabilistic models. [4] Nichol & Dhariwal, 2021. Improved denoising diffusion probabilistic models. [5] Ronneberger et al., 2015. U‑Net: Convolutional networks for biomedical image segmentation. [6] Ho & Salimans, 2022. Classifier‑free diffusion guidance. [7] Dehan et al., 2022. Complete and temporally consistent video outpainting.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI temporal consistency video outpainting 3D U-Net

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.