Plug‑and‑Play reAR Boosts Visual AR to SOTA Quality with Only 177M Parameters
The paper introduces reAR, a plug‑and‑play regularization framework that aligns generator and tokenizer representations in visual autoregressive models, dramatically improving image quality and matching large diffusion models while using far fewer parameters, and validates the approach with extensive experiments, ablations, and scalability analysis.
Highlights
Identifies generator‑tokenizer inconsistency as the main bottleneck of visual autoregressive (AR) generation. reAR, a plug‑and‑play training regularization, injects visual inductive bias from the tokenizer and mitigates exposure bias. Demonstrates that reAR with only 177 M parameters achieves a gFID of 1.42, comparable to a 675 M‑parameter state‑of‑the‑art diffusion model.
Problem Statement
Visual AR models lag behind diffusion models in image generation quality. The authors pinpoint the core issue as a mismatch between the generator and the tokenizer: the tokenizer cannot reliably decode the token sequences produced by the generator, leading to exposure bias and embedding unawareness.
Proposed Solution: reAR
reAR is a plug‑and‑play regularization framework consisting of two complementary strategies:
Noisy Context Regularization : During training, uniform noise is added to the input token sequence, reducing reliance on a clean context and improving robustness to exposure bias.
Codebook Embedding Regularization : Aligns the generator’s hidden states with the tokenizer’s embedding space by training the transformer to recover the visual embedding of the current token in shallow layers and predict the embedding of the next token in deeper layers.
The framework requires no changes to the tokenizer, generation order, inference pipeline, or external models.
Technical Details
Decoder‑only Transformer is used for next‑token prediction.
Regularization targets are added at specific layers (e.g., EN@0 for embedding alignment, DE@15 for decoding alignment).
Linear annealing schedule controls the noise level during training.
Key equations are illustrated in the original figures (see images below).
Experimental Setup
Datasets and evaluation: ImageNet‑1K (256×256) with ADM protocol, generating 50k images per model. Metrics: FID (lower is better) and IS (higher is better). Baselines include diffusion models, mask‑generation models, VAR, MAR, and standard raster AR.
Model configuration: MaskGIT VQGAN (rFID = 1.97) as tokenizer, DiT‑style AR backbone. Three model sizes – reAR‑S (177 M), reAR‑B (201 M), reAR‑L (400 M) – with 20/24/24 transformer layers and hidden sizes 768/768/1024. Experiments also combine reAR with TiTok and AliTok tokenizers.
Training: 8 × A800 GPUs, 400 epochs, batch size 2048, AdamW optimizer, gradient clipping (norm = 1), learning‑rate warm‑up to 1e‑3 then cosine decay, class label dropout 0.1.
Implementation: Linear schedule for noise level, 2‑layer MLP (hidden = 2048) for embedding projection, regularization applied at layer 0 (encoding) and layer 15 (decoding) for reAR‑S/B/L respectively.
Results
Generation quality : reAR‑S achieves FID = 2.00 and IS = 295.7, surpassing LlamaGen‑XL (FID = 2.34, IS = 253.9) with only 14 % of the parameters.
Parameter efficiency : With 177 M parameters, reAR matches the gFID = 1.42 of a 675 M‑parameter diffusion model (REPA).
Generalization : Improves performance on non‑standard tokenizers (TiTok: FID 4.45 → 4.01; AliTok: FID 1.50 → 1.42).
Scaling behavior : FID continuously decreases as model size and training epochs increase, indicating good scalability.
Sampling speed : KV‑cache enables reAR to sample faster than diffusion models and MAR; reAR‑B‑AliTok outperforms parallel decoding methods (Maskbit, TiTok, VAR) in both speed and FID.
Ablation Studies
Key components were examined:
Regularization layer selection : Encoding regularization works best at the first layer; decoding regularization peaks at layer 15. Applying regularization to layers farther from these points degrades quality.
Regularization weight : Varying the weight has negligible impact, likely due to AdamW’s scale‑invariance.
Noise schedule : Fixed noise level (0.2) yields FID 2.12; randomizing noise in [0, 0.5] improves to 2.05; linear annealing to 0.5 achieves the best FID = 2.00.
Joint effect : Combining noisy context and codebook embedding regularization reduces FID from 2.12/2.18 (single components) to 2.00.
Conclusion
The study identifies generator‑tokenizer mismatch as the primary bottleneck of visual AR generation and proposes reAR, a simple yet effective regularization that dramatically improves quality, generalization, and efficiency without altering core model components. The authors hope this work encourages future research toward unified multimodal models.
References
[1] REAR: Rethinking Visual Autoregressive Models via Generator‑Tokenizer Consistency Regularization
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
