Artificial Intelligence 7 min read

VAR: Scalable Image Generation via Next‑Scale Prediction Wins NeurIPS 2024 Best Paper

The VAR model, a Visual AutoRegressive framework that introduces a novel multi‑scale “next‑scale prediction” paradigm, dramatically improves image generation efficiency and quality, surpasses diffusion models, validates scaling laws in vision, and earned the Best Paper award at NeurIPS 2024.

DataFunTalk
DataFunTalk
DataFunTalk
VAR: Scalable Image Generation via Next‑Scale Prediction Wins NeurIPS 2024 Best Paper

The Visual AutoRegressive (VAR) model, co‑developed by Peking University and ByteDance, won the Best Paper award at NeurIPS 2024 by introducing a groundbreaking "next‑scale prediction" approach that replaces traditional pixel‑by‑pixel generation with multi‑scale token generation.

From "pixel‑by‑pixel" to "scale‑by‑scale"

VAR encodes an image into a hierarchy of tokens using a multi‑scale VQVAE. Generation starts from the lowest‑resolution global token and progressively refines higher‑resolution details, enabling parallel prediction at each scale and preserving spatial locality.

Implementation details

Encode the original image into multi‑resolution tokens via a multi‑scale VQVAE.

At each resolution, a parallel autoregressive model predicts tokens using context from the lower‑resolution layer.

This design avoids flattening the image into a 1‑D sequence, thus maintaining spatial structure while enabling massive parallelism.

Efficiency gains

Traditional autoregressive models have O(N) complexity per pixel, while VAR reduces complexity to O(log N) by generating tokens scale‑by‑scale. In practice, VAR is about 20× faster than diffusion models and approaches real‑time GAN speeds.

Scaling law in visual generation

Experiments show a strong linear relationship between model size, compute, and performance, mirroring the scaling laws observed in large language models. Larger VAR models consistently achieve lower FID scores and higher IS, confirming the law’s validity in vision.

Performance on ImageNet‑256

VAR achieves an FID of 1.73 (vs. 2.27 for DiT‑XL/2 and 15.78 for VQGAN) and an IS of 350.2, while requiring only 10 inference steps—about 20× fewer than diffusion models.

VAR also demonstrates zero‑shot generalization to tasks such as in‑painting, out‑painting, and conditional editing without additional training.

Future directions

Planned extensions include coupling VAR with large language models for text‑to‑image generation, expanding the multi‑scale paradigm to video generation, and applying the technology to game development, film VFX, and educational visualizations.

Project repository: https://github.com/FoundationVision/VAR

image generationscaling lawsNeurIPS2024visual AIautoregressive models
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.