PKU Introduces Next Patch Prediction for Image Generation, Cutting Training Cost to ~0.6×
The paper proposes a Next Patch Prediction (NPP) paradigm that groups image tokens into high‑density patches, enabling autoregressive models to predict patches instead of individual tokens, which reduces training cost to about 0.6× and improves ImageNet FID scores by up to 1.0 across models ranging from 100 M to 1.4 B parameters.
Overview
Autoregressive models generate sequences by predicting the next token. In visual generation, discrete tokenizers such as VQVAE, VQGAN, DALL‑E and Parti convert images into token streams that can be modeled autoregressively. The paper revisits next‑token prediction for images and proposes Next Patch Prediction (NPP) , which groups consecutive image tokens into high‑density patches and trains the model to predict the next patch instead of a single token.
Method
1. Tokenization and embedding – An input image is encoded by a VQGAN tokenizer (vocabulary size 16,384) into a sequence of discrete token indices. Each index is mapped to an embedding vector.
2. Patch formation – A fixed integer k (tokens per patch) determines how many consecutive embeddings are averaged to produce one patch embedding. The resulting patch sequence length is reduced by a factor of k. No additional parameters are introduced; the averaging operation is parameter‑free.
3. Patch cross‑entropy loss – Because a patch does not have a single ground‑truth token, the loss uses the k ground‑truth token indices belonging to the *next* patch. For a predicted patch distribution p and the set of true token indices {t_1,…,t_k}, the loss is<br/> CE_{patch}= -\frac{1}{k}\sum_{i=1}^{k}\log p(t_i) which allows standard cross‑entropy training without extra heads.
4. Multi‑scale coarse‑to‑fine schedule – Training starts with a large patch size PS_{large} (e.g., 8×8 tokens) yielding a very short sequence. After a predefined number of steps, a scheduling factor reduces the patch size to PS_{small} (e.g., 2×2 tokens). Each stage uses its own learning‑rate warm‑up (1 epoch) followed by linear decay. Because the sequence shortens at every stage, the overall compute cost drops to roughly 0.6× of the baseline.
Experimental Setup
Backbone: LlamaGen autoregressive transformer with the same architecture as the original paper.
Models: parameter counts of 100 M, 400 M, 1.0 B and 1.4 B.
Dataset: ImageNet‑1K class‑conditional generation, resolution 256×256.
Training: 300 epochs, batch size 256, AdamW optimizer, weight decay 0.01 (as in LlamaGen), gradient clipping 1.0, dropout 0.1 for both the transformer backbone and class‑token embeddings.
Learning‑rate schedule: per‑patch‑size segment, warm‑up for the first epoch of the segment, then linear decay to zero over the remaining epochs of that segment.
Baselines: state‑of‑the‑art GANs, diffusion models, mask‑prediction models, and the original LlamaGen autoregressive model.
Results
Across all model scales, NPP improves ImageNet FID while reducing training compute.
LlamaGen‑B (≈400 M) and LlamaGen‑L (≈1.0 B) achieve an average FID reduction of 1.0 points compared with the baseline, with total training cost ≈ 0.6×.
LlamaGen‑XL (≈1.4 B) and LlamaGen‑XXL (≈2.0 B) trained at 256×256 with NPP surpass baseline models that were trained at the higher 384×384 resolution.
The cost‑vs‑quality trade‑off is illustrated in the paper’s Table 1: as k increases, sequence length shrinks, compute drops proportionally, and FID degrades only marginally until a threshold where patch granularity becomes too coarse.
Qualitative figures show the progressive patch grouping (large patches early, finer patches later) and the resulting image samples.
Conclusion
Next Patch Prediction groups discrete image tokens into dense patches, enabling a shorter autoregressive input sequence. The approach preserves the original transformer architecture, adds no trainable parameters, and requires only a simple averaging operation and a patched cross‑entropy loss. Empirically, it cuts training computation to about 0.6× and improves ImageNet FID by up to 1.0 point, making it a drop‑in enhancement for existing autoregressive visual generators.
Paper: https://arxiv.org/pdf/2412.15321
Code: https://github.com/PKU-YuanGroup/Next-Patch-Prediction
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
