How NOVA Generates High‑Quality Video Autoregressively Without Vector Quantization

This article provides an in‑depth analysis of the NOVA model, a non‑quantized autoregressive video generation framework that combines frame‑by‑frame temporal prediction with set‑by‑set spatial prediction, uses diffusion loss for token estimation, and achieves state‑of‑the‑art results on multiple video and image benchmarks.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
How NOVA Generates High‑Quality Video Autoregressively Without Vector Quantization

Introduction

The NOVA paper (ICLR 2025) proposes a highly efficient autoregressive video generation method that eliminates the need for vector quantization, reducing training cost while improving generation quality and zero‑shot generalization for both text‑to‑image and text‑to‑video tasks.

NOVA Model Overview

NOVA treats video generation as two coupled problems: (1) temporal frame‑by‑frame prediction and (2) spatial set‑by‑set prediction within each frame. It retains a GPT‑style causal mode for temporal modeling and employs bidirectional attention for spatial modeling, enabling flexible in‑context capabilities.

Autoregressive Modeling in Video Generation

Traditional autoregressive video models use raster‑scan token ordering, which can be inefficient for long sequences. NOVA instead predicts entire frames sequentially (temporal) while predicting token sets within each frame in a random order (spatial), decoupling the two processes for better scalability.

Temporal Autoregressive Modeling

Frames are generated causally: each frame attends to the text prompt, video flow, and all previous frames, while tokens within the current frame can attend to each other. This is implemented via block‑wise causal masking attention.

NOVA framework and inference process
NOVA framework and inference process

Spatial Set‑by‑Set Modeling

Inspired by MaskGIT and MAR, NOVA predicts token sets in a random order using a bidirectional Transformer decoder. Indicator features and a scaling‑shift layer help maintain consistency across frames, preventing collapse of image structure as frame count grows.

Spatial generalized autoregressive attention
Spatial generalized autoregressive attention

Diffusion Loss for Token Prediction

During training, NOVA adopts the diffusion loss from MAR to estimate per‑token probabilities in continuous space. Ground‑truth tokens are perturbed with Gaussian noise according to a noise schedule, and a small MLP predicts denoised tokens, enabling classifier‑free guidance at inference.

Diffusion loss formulation
Diffusion loss formulation

Training Data

The authors first collected 16 M image‑text pairs from DataComp, COYO, Unsplash, and JourneyDB, then expanded to ~600 M pairs by selecting high‑aesthetic images from LAION, DataComp, and COYO. For video‑text data, they used 19 M pairs from the Panda‑70M subset, additional internal pairs, and 1 M high‑resolution pairs from Pexels.

Architecture & Training Details

Spatial AR layers and denoising MLP blocks follow the MAR design. The model comprises a temporal encoder (16 layers, 768‑dim), a spatial encoder (16 layers, 1024‑dim), and a decoder (16 layers, 1536‑dim). A 3‑layer denoising MLP (1280‑dim) is added. The VAE backbone is taken from Open‑Sora‑Plan, providing 4× temporal and 8× spatial compression. Masking and diffusion schedulers follow MAR and IDDPM, respectively, with 1000‑step noise schedules and 100 inference steps. Training proceeds by first pre‑training a text‑to‑image model, then fine‑tuning on text‑to‑video.

Experimental Results

NOVA achieves VBench 80.1 for video generation and GenEval 0.75 for text‑to‑image, surpassing many state‑of‑the‑art models while using a fraction of the parameters (0.6 B vs. 8‑9 B). Qualitative examples demonstrate strong visual fidelity, accurate color binding, and consistent object motion across varied prompts.

Text‑to‑image evaluation
Text‑to‑image evaluation
Text‑to‑video evaluation
Text‑to‑video evaluation

Conclusion

By removing vector quantization and combining autoregressive temporal modeling with set‑by‑set spatial prediction, NOVA delivers a compact yet powerful video generation system that bridges the gap between autoregressive and diffusion models, offering lower latency and competitive quality across benchmarks.

Video GenerationAI researchNOVAAutoregressive Modeldiffusion lossnon‑quantized
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.