VAR: Scalable Image Generation via Next‑Scale Prediction Wins NeurIPS 2024 Best Paper
The VAR model, a Visual AutoRegressive framework that introduces a novel multi‑scale “next‑scale prediction” paradigm, dramatically improves image generation efficiency and quality, surpasses diffusion models, validates scaling laws in vision, and earned the Best Paper award at NeurIPS 2024.
The Visual AutoRegressive (VAR) model, co‑developed by Peking University and ByteDance, won the Best Paper award at NeurIPS 2024 by introducing a groundbreaking "next‑scale prediction" approach that replaces traditional pixel‑by‑pixel generation with multi‑scale token generation.
From "pixel‑by‑pixel" to "scale‑by‑scale"
VAR encodes an image into a hierarchy of tokens using a multi‑scale VQVAE. Generation starts from the lowest‑resolution global token and progressively refines higher‑resolution details, enabling parallel prediction at each scale and preserving spatial locality.
Implementation details
Encode the original image into multi‑resolution tokens via a multi‑scale VQVAE.
At each resolution, a parallel autoregressive model predicts tokens using context from the lower‑resolution layer.
This design avoids flattening the image into a 1‑D sequence, thus maintaining spatial structure while enabling massive parallelism.
Efficiency gains
Traditional autoregressive models have O(N) complexity per pixel, while VAR reduces complexity to O(log N) by generating tokens scale‑by‑scale. In practice, VAR is about 20× faster than diffusion models and approaches real‑time GAN speeds.
Scaling law in visual generation
Experiments show a strong linear relationship between model size, compute, and performance, mirroring the scaling laws observed in large language models. Larger VAR models consistently achieve lower FID scores and higher IS, confirming the law’s validity in vision.
Performance on ImageNet‑256
VAR achieves an FID of 1.73 (vs. 2.27 for DiT‑XL/2 and 15.78 for VQGAN) and an IS of 350.2, while requiring only 10 inference steps—about 20× fewer than diffusion models.
VAR also demonstrates zero‑shot generalization to tasks such as in‑painting, out‑painting, and conditional editing without additional training.
Future directions
Planned extensions include coupling VAR with large language models for text‑to‑image generation, expanding the multi‑scale paradigm to video generation, and applying the technology to game development, film VFX, and educational visualizations.
Project repository: https://github.com/FoundationVision/VAR
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.