Artificial Intelligence 12 min read

Hierarchical Masked 3D Diffusion Model for Video Outpainting

The Hierarchical Masked 3D Diffusion Model (M3DDM) introduces a masking‑based training strategy and cross‑attention with global video clips to achieve temporally consistent video outpainting, while a hybrid coarse‑to‑fine inference pipeline mitigates error accumulation, delivering state‑of‑the‑art results and deployment in Alibaba’s creative center.

Alimama Tech

Jan 24, 2024

Hierarchical Masked 3D Diffusion Model for Video Outpainting

This paper introduces a novel video outpainting method based on diffusion models called Hierarchical Masked 3D Diffusion Model (M3DDM). Video outpainting extends video boundaries while maintaining temporal consistency, which is more challenging than image outpainting. The proposed M3DDM uses a masking-based training strategy and incorporates global video clips into cross-attention layers to ensure temporal consistency across multiple video segments through guided frame techniques and reduce inter-frame jitter. Additionally, a hybrid coarse-to-fine inference pipeline is proposed to address error accumulation in long video outpainting. The method achieves state-of-the-art results on video outpainting tasks.

The algorithm has been deployed in Alibaba's creative center and the related paper has been published in ACM MM2023. The code is now open-sourced.

Paper Title: Hierarchical Masked 3D Diffusion Model for Video Outpainting

Paper Download: https://arxiv.org/abs/2309.02119

Project Page: https://fanfanda.github.io/M3DDM/

Code Repository: https://github.com/alimama-creative/M3DDM-Video-Outpainting

The method addresses two main challenges in video outpainting: ensuring temporal consistency across video segments and mitigating error accumulation in long videos. The solution involves building a 3D video diffusion model based on Stable Diffusion's parameter prior, using guided frames to connect video segments, incorporating global frames as prompts in cross-attention layers, and proposing a hybrid coarse-to-fine inference pipeline.

Experimental results show significant improvements over existing methods like Dehan and Simple Diffusion Model on DAVIS and YouTube-VOS datasets. The algorithm is currently deployed in Alibaba's creative center for advertisers to modify video sizes for various ad placements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

diffusion models 3D diffusion ACM MM2023 AI video processing hierarchical masked temporal consistency video outpainting

Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.