NOVA: Redefining Autoregressive Visual Modeling Without Vector Quantization

NOVA introduces a highly efficient autoregressive video generation framework that eliminates vector quantization, combines frame‑by‑frame causal prediction with set‑by‑set spatial attention, and achieves state‑of‑the‑art quality on VBench and GenEval while offering strong zero‑shot generalization across text‑to‑image and text‑to‑video tasks.

AIWalker
AIWalker
AIWalker
NOVA: Redefining Autoregressive Visual Modeling Without Vector Quantization

NOVA: Autoregressive Video Generation Without Vector Quantization

Paper title: Autoregressive Video Generation without Vector Quantization (ICLR 2025)

Paper URL: http://arxiv.org/pdf/2412.14169

Project page: http://github.com/baaivision/NOVA

Model Overview

Traditional autoregressive visual models rely on vector quantization to convert images or video frames into discrete tokens, which leads to high token counts and large computational overhead for high‑resolution or long videos. NOVA treats visual tokens as continuous vectors and applies two complementary prediction strategies:

Temporal dimension: causal frame‑by‑frame prediction.

Spatial dimension: set‑by‑set prediction within each frame using bi‑directional attention.

This decoupling retains the in‑context flexibility of GPT‑style causal models while enabling efficient parallel decoding inside frames.

Temporal Autoregressive Modeling

Frames are modeled as a causal sequence. For each time step the model attends to the text prompt, motion flow, and all previously generated frames, while tokens within the current frame can attend to each other (block‑wise causal masking). Text is encoded with Phi‑2, motion scores are derived from optical flow (OpenCV), and a 3D VAE (temporal stride 4, spatial stride 8) compresses frames into a latent space.

NOVA framework and inference process
NOVA framework and inference process

Spatial Set‑by‑Set Autoregression

Inspired by MaskGIT and MAR, NOVA predicts token sets in a random order within a frame. Indicator features derived from neighboring frames guide the spatial AR process. A Scaling‑and‑Shift layer learns frame‑wise motion adjustments, improving temporal consistency.

Spatial generalized autoregressive attention vs. per‑token prediction
Spatial generalized autoregressive attention vs. per‑token prediction

Training Objective

During training NOVA adopts the diffusion loss from MAR: each token’s continuous representation is denoised from Gaussian noise using a diffusion MLP. The loss follows the standard diffusion schedule with 1000 noise steps and 100 inference steps.

Diffusion loss equation
Diffusion loss equation

Dataset and Training Details

Image‑text pairs: 16 M collected from DataComp, COYO, Unsplash, and JourneyDB, then expanded to ~600 M high‑aesthetic images (aesthetic score ≥ 5) from LAION, DataComp, and COYO. Video‑text pairs: 19 M from a subset of Panda‑70M and internal sources, plus 1 M high‑resolution pairs from Pexels (max text length 256).

Architecture:

Temporal encoder: 16‑layer transformer, 768‑dim, 0.3 B parameters.

Spatial encoder: 16‑layer transformer, 1024‑dim, 0.6 B parameters.

Decoder: 16‑layer transformer, 1536‑dim, 1.4 B parameters.

Denoising MLP: 3 layers, 1280‑dim.

Masking follows MAR’s strategy; diffusion scheduling uses IDDPM with a 1000‑step noise schedule during training and 100 inference steps. The model is first pretrained on text‑to‑image, then the weights are loaded to train the text‑to‑video component.

Evaluation

Benchmarks:

Text‑to‑image: T2I‑CompBench, GenEval, DPG‑Bench.

Text‑to‑video: VBench (16‑dim evaluation).

For each prompt, five samples of size 33 × 768 × 480 are generated using classifier‑free guidance (scale 7.0) and 128 autoregressive steps.

Results:

On GenEval, NOVA achieves 0.75, surpassing PixArt‑α, Stable Diffusion v1/v2, SDXL, DALL‑E 2/3, SD3, LlamaGen, and Emu3.

On VBench, NOVA (0.6 B parameters) matches or exceeds larger models such as CogVideo (9 B) and Emu3 (8 B) while offering significantly lower inference latency (80.12 vs. 80.96 for Emu3).

Text‑to‑image benchmark comparison
Text‑to‑image benchmark comparison
Text‑to‑video benchmark comparison
Text‑to‑video benchmark comparison

Qualitative examples demonstrate NOVA’s ability to preserve color fidelity, spatial relationships, and realistic motion, including zero‑shot video generation from a reference image with or without textual prompts.

Text‑to‑image generation results
Text‑to‑image generation results
Text‑to‑video generation results
Text‑to‑video generation results
Zero‑shot generalization with and without text
Zero‑shot generalization with and without text

References

Video generation models as world simulators

Kling AI

Stable video diffusion: Scaling latent video diffusion models to large datasets

Emu3: Next‑Token Prediction is All You Need

Autoregressive image generation without vector quantization

MagVIT: Masked generative video transformer

Open‑Sora Plan: Open‑source large video generation model

MaskGIT: Masked generative image transformer

text-to-videoNOVAbenchmark resultsdiffusion lossautoregressive video generationnon‑quantized modeling
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.