NOVA: Redefining Autoregressive Visual Modeling Without Vector Quantization
NOVA introduces a highly efficient autoregressive video generation framework that eliminates vector quantization, combines frame‑by‑frame causal prediction with set‑by‑set spatial attention, and achieves state‑of‑the‑art quality on VBench and GenEval while offering strong zero‑shot generalization across text‑to‑image and text‑to‑video tasks.
NOVA: Autoregressive Video Generation Without Vector Quantization
Paper title: Autoregressive Video Generation without Vector Quantization (ICLR 2025)
Paper URL: http://arxiv.org/pdf/2412.14169
Project page: http://github.com/baaivision/NOVA
Model Overview
Traditional autoregressive visual models rely on vector quantization to convert images or video frames into discrete tokens, which leads to high token counts and large computational overhead for high‑resolution or long videos. NOVA treats visual tokens as continuous vectors and applies two complementary prediction strategies:
Temporal dimension: causal frame‑by‑frame prediction.
Spatial dimension: set‑by‑set prediction within each frame using bi‑directional attention.
This decoupling retains the in‑context flexibility of GPT‑style causal models while enabling efficient parallel decoding inside frames.
Temporal Autoregressive Modeling
Frames are modeled as a causal sequence. For each time step the model attends to the text prompt, motion flow, and all previously generated frames, while tokens within the current frame can attend to each other (block‑wise causal masking). Text is encoded with Phi‑2, motion scores are derived from optical flow (OpenCV), and a 3D VAE (temporal stride 4, spatial stride 8) compresses frames into a latent space.
Spatial Set‑by‑Set Autoregression
Inspired by MaskGIT and MAR, NOVA predicts token sets in a random order within a frame. Indicator features derived from neighboring frames guide the spatial AR process. A Scaling‑and‑Shift layer learns frame‑wise motion adjustments, improving temporal consistency.
Training Objective
During training NOVA adopts the diffusion loss from MAR: each token’s continuous representation is denoised from Gaussian noise using a diffusion MLP. The loss follows the standard diffusion schedule with 1000 noise steps and 100 inference steps.
Dataset and Training Details
Image‑text pairs: 16 M collected from DataComp, COYO, Unsplash, and JourneyDB, then expanded to ~600 M high‑aesthetic images (aesthetic score ≥ 5) from LAION, DataComp, and COYO. Video‑text pairs: 19 M from a subset of Panda‑70M and internal sources, plus 1 M high‑resolution pairs from Pexels (max text length 256).
Architecture:
Temporal encoder: 16‑layer transformer, 768‑dim, 0.3 B parameters.
Spatial encoder: 16‑layer transformer, 1024‑dim, 0.6 B parameters.
Decoder: 16‑layer transformer, 1536‑dim, 1.4 B parameters.
Denoising MLP: 3 layers, 1280‑dim.
Masking follows MAR’s strategy; diffusion scheduling uses IDDPM with a 1000‑step noise schedule during training and 100 inference steps. The model is first pretrained on text‑to‑image, then the weights are loaded to train the text‑to‑video component.
Evaluation
Benchmarks:
Text‑to‑image: T2I‑CompBench, GenEval, DPG‑Bench.
Text‑to‑video: VBench (16‑dim evaluation).
For each prompt, five samples of size 33 × 768 × 480 are generated using classifier‑free guidance (scale 7.0) and 128 autoregressive steps.
Results:
On GenEval, NOVA achieves 0.75, surpassing PixArt‑α, Stable Diffusion v1/v2, SDXL, DALL‑E 2/3, SD3, LlamaGen, and Emu3.
On VBench, NOVA (0.6 B parameters) matches or exceeds larger models such as CogVideo (9 B) and Emu3 (8 B) while offering significantly lower inference latency (80.12 vs. 80.96 for Emu3).
Qualitative examples demonstrate NOVA’s ability to preserve color fidelity, spatial relationships, and realistic motion, including zero‑shot video generation from a reference image with or without textual prompts.
References
Video generation models as world simulators
Kling AI
Stable video diffusion: Scaling latent video diffusion models to large datasets
Emu3: Next‑Token Prediction is All You Need
Autoregressive image generation without vector quantization
MagVIT: Masked generative video transformer
Open‑Sora Plan: Open‑source large video generation model
MaskGIT: Masked generative image transformer
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
