Artificial Intelligence 34 min read

How Meta’s Movie Gen Pushes Text‑to‑Video Generation to New Heights

Meta’s newly released 92‑page Movie Gen paper introduces a multimodal LLM that unifies text‑to‑image, text‑to‑video, personalized video, precise video editing, and audio generation, detailing its dual‑model architecture, training pipeline, temporal auto‑encoder design, scaling strategies, evaluation benchmark, and ablation studies.

Baobao Algorithm Notes

Oct 17, 2024

How Meta’s Movie Gen Pushes Text‑to‑Video Generation to New Heights

On October 4, Meta published a 92‑page research report on Movie Gen , a set of state‑of‑the‑art multimodal large language models (LLMs) capable of high‑quality image and video synthesis, supporting various aspect ratios, synchronized audio, text‑to‑video, personalized video generation from user images, precise video editing, video‑to‑audio, and text‑to‑audio. This analysis condenses the paper’s technical contributions.

Movie Gen Video Architecture

The core system consists of two models: Movie Gen Video and Movie Gen Audio . Movie Gen Video is first trained on joint text‑image and text‑video data, then fine‑tuned (post‑training) to acquire personalized video and editing capabilities. Movie Gen Audio is trained on video‑to‑audio and text‑to‑audio data.

Training proceeds in multiple stages. Initially, a low‑resolution (256×256) text‑image dataset is used for cold‑start training. The model then scales to higher resolution (768×768) text‑image data and finally to high‑quality text‑video data for supervised fine‑tuning (SFT). This staged approach accelerates convergence and reduces memory usage.

Temporal Auto‑Encoder (TAE)

TAE is initialized from a pretrained image VAE and extended along the temporal dimension by inserting 1‑D temporal convolutions and attention after each 2‑D spatial block. A stride‑2 convolution downsamples the video from T'×3×H'×W' to a latent space T×C×H×W with a down‑sampling factor of 8 in each dimension, dramatically reducing LLM compute cost. After encoding, a nearest‑neighbor up‑sampler restores the original resolution.

Standard diffusion loss caused speckle artifacts in decoded videos. Meta introduced an Outlier Penalty Loss that penalizes latent vectors with high norm (scale factor = 3) and added it to the VAE loss with a weight of 1e5, effectively removing the speckles.

Because high‑resolution long videos cannot be encoded end‑to‑end due to memory limits, the authors split videos into overlapping temporal chunks, process each chunk independently, and blend the outputs with weighted averaging to avoid boundary artifacts. Encoder chunks use 32 original frames per chunk; decoder chunks use 16 frames with 16‑frame overlap.

Training Infrastructure

6144 H100 GPUs (each 700 W, 80 GB HBM3) on Meta’s Grand Teton AI server platform.

Eight GPUs per server connected via NVSwitch; cross‑server 400 Gbps RoCE RDMA NICs.

Global‑scale training scheduler orchestrates the workload.

Parallelism combines Tensor Parallelism (TP), Sequence Parallelism (SP), Context Parallelism (CP), and Fully‑Sharded Data Parallelism (FSDP) to distribute computation and memory across the GPU fabric.

Pre‑training Data and Filtering

The pre‑training corpus contains billions of image‑text pairs and video‑text pairs. Video clips range from 4 s to 2 min, later filtered to 4‑16 s. A three‑stage filtering pipeline removes low‑quality, low‑motion, and duplicate content:

Visual filtering (resolution < 720 px, aspect‑ratio balance, OCR‑based text removal, scene‑boundary detection, visual quality model, removal of unstable camera starts).

Motion filtering (static‑video detector, VMAF motion score, scene‑change detection to discard excessive camera shake).

Content filtering (duplicate detection in embedding space, concept‑aware resampling, clustering of semantic embeddings).

Subsequent multi‑stage data splits provide progressively stricter visual, motion, and content thresholds for low‑resolution and high‑resolution training phases.

Supervised Fine‑Tuning (SFT)

SFT uses a curated set of high‑quality, manually annotated video‑subtitle pairs. The pipeline includes candidate video collection, concept balancing via k‑NN retrieval, manual selection of cinematic clips, and detailed subtitle refinement (including camera motion tags). Training employs a smaller batch size on 64 nodes (512 H100 GPUs) with cosine learning‑rate scheduling.

Inference and Sampling

During inference, prompts are rewritten by a teacher‑student LLM pipeline (70B → 8B) to match the distribution of training subtitles. A linear‑quadratic time schedule with only 50 steps approximates a 250‑step generation process, achieving comparable quality while reducing compute by ~20×. Euler ODE solvers outperform adaptive solvers like Dopri5.

Evaluation Benchmark

Meta introduced Movie Gen Video Bench , a 1000‑prompt benchmark covering human activities, animals, landscapes, physics, and unusual concepts, each annotated with motion intensity (high/medium/low). Human evaluation assesses text alignment (subject and motion), visual quality (frame consistency, motion completeness, naturalness, overall quality), and realism/aesthetics. Automated metrics (FVD, IS) were found unreliable, so extensive human studies with detailed guidelines were employed.

Results show Movie Gen Video significantly outperforms prior models (Runway Gen‑3, LumaLabs, Kling 1.5, OpenAI Sora) across all dimensions.

Ablation Studies

Flow‑matching training objective surpasses diffusion‑based objectives in both overall quality and text alignment.

Video‑level subtitles generated by LLaMa3‑Video improve motion alignment by 10.7 % compared to frame‑wise subtitle rewriting.

LLaMa3‑based transformer architecture outperforms Diffusion Transformer baselines.

2.5‑D (spatial 2‑D + temporal 1‑D) attention in TAE offers a good trade‑off between quality and memory versus full 3‑D attention.

Outlier Penalty loss reduces speckle artifacts and improves PSNR/SSIM/FID on reconstructed video clips.

Image Generation Extension

By swapping the video TAE with an image auto‑encoder and fine‑tuning on a 1k‑image dataset, Movie Gen also achieves state‑of‑the‑art text‑to‑image results (1024 px resolution) evaluated via human pairwise comparisons for text fidelity and visual quality.

Further work will cover personalized video generation from user images, precise instruction‑based video editing, video‑to‑audio, and text‑to‑audio.

For additional resources see the official Meta blog and research videos (links omitted for brevity).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning video generation Evaluation text-to-video Model Scaling multimodal LLM

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Movie Gen Video Architecture

Temporal Auto‑Encoder (TAE)

Training Infrastructure

Pre‑training Data and Filtering

Supervised Fine‑Tuning (SFT)

Inference and Sampling

Evaluation Benchmark

Ablation Studies

Image Generation Extension

Baobao Algorithm Notes

How this landed with the community

Was this worth your time?

0 Comments

Movie Gen Video Architecture