SimpleAR: High‑Quality 1024×1024 Images with Just 0.5B Parameters via Pretraining, SFT, and RL
SimpleAR demonstrates that a vanilla autoregressive model with only 0.5 B parameters can generate high‑fidelity 1024×1024 images, covering pretraining, supervised fine‑tuning, and reinforcement learning, achieving competitive GenEval (0.59) and DPG‑Bench (79.66) scores while reducing inference time to about 14 seconds with vLLM and KV‑cache optimizations.
SimpleAR: Minimal Autoregressive Visual Generation Framework
SimpleAR is a compact autoregressive (AR) visual generation model (0.5 B parameters) that integrates large‑scale pretraining, supervised fine‑tuning (SFT), and reinforcement learning (RL) to generate 1024×1024 images with high fidelity.
Why Autoregressive Models?
AR models predict each visual token sequentially, giving precise fine‑grained control and natural multimodal alignment. Historically they lag behind diffusion models because (1) discrete visual tokenizers limit quality and (2) visual token sequences are much longer than text, making long‑range dependency modeling harder.
Three‑Stage Training Strategy
Pretraining on diverse image‑text corpora (CC3M, CC12M, OpenImages, SAM1B, MegaImage ≈ 43 M samples) to learn generic visual‑language patterns.
SFT on curated high‑resolution datasets (JourneyDB, synthetic 1 M, internal 10 M) to improve image fidelity and instruction alignment.
RL using Group Relative Policy Optimization (GRPO) to further align multimodal outputs and reduce bias.
Pretraining and SFT Details
Both stages use a language‑modeling loss that predicts the next token conditioned on all previous visual tokens and the current text token. The visual tokenizer is Cosmos‑Tokenizer (codebook size 64 K, down‑sampling ratio 16). The transformer follows the Qwen decoder architecture and is initialized with large‑language‑model weights.
Reinforcement Learning with GRPO
After SFT, GRPO creates a trainable policy model and a frozen reference model. For each text prompt the policy samples a candidate token sequence; the objective maximizes a reward composed of a CLIP‑ViT‑H‑14 (or its fine‑tuned variant HPSv2) score minus a KL‑penalty: R = clip_score - \beta \cdot KL CLIP‑ViT‑H‑14 yields a larger GenEval gain (+0.6 points for the 0.5 B model) than HPSv2.
Inference and Acceleration Techniques
Inference proceeds token‑by‑token with greedy sampling (top‑k=64 000, i.e., the full codebook). Classifier‑free guidance (CFG) is applied to boost quality. To mitigate AR latency, SimpleAR adopts:
KV Cache : stores key‑value pairs from previous attention layers, reducing per‑step computation.
vLLM Serving : uses paged attention and optimized memory management for high‑throughput, low‑latency decoding.
Speculative Jacobi Decoding (SJD) : a draft model generates multiple candidate token sequences, which the target model validates, roughly halving the number of decoding steps.
Experimental Setup
All experiments run on 32 × NVIDIA A100 GPUs. Learning rates: 1e‑4 (pretraining), 2e‑5 (SFT), 1e‑5 (RL). Batch sizes: 256 (pretraining/SFT), 28 (RL). No warm‑up or LR decay. AdamW optimizer is used throughout.
Quantitative Results
SimpleAR (0.5 B) achieves GenEval 0.59 and DPG‑Bench 79.66, outperforming other sub‑1 B models and matching larger diffusion baselines that require separate text encoders. Scaling to 1.5 B improves GenEval by +0.04 and DPG‑Bench by +1.85, demonstrating predictable scaling similar to LLMs.
RL Ablation Study
Two reward modules were compared: CLIP‑ViT‑H‑14 and its fine‑tuned variant HPSv2. Both improve performance, but CLIP yields a larger GenEval gain (+0.6). Qualitative samples show better rendering of quantities and spatial relations when CLIP is used.
Inference Speedup
On an A100 node with CFG enabled, KV‑Cache reduces inference time by 34 %. Adding vLLM further cuts the generation time for a 1024×1024 image to 13.55 s. Speculative Jacobi Decoding halves the number of decoding steps, offering a modest DPG boost but not reducing wall‑clock latency because it cannot be combined with KV‑Cache.
Qualitative Samples and Failure Cases
High‑fidelity examples demonstrate strong instruction alignment on complex prompts. Failure cases reveal limitations in generating intricate poses, objects, or text, and occasional violations of physical laws due to limited data and model size.
References
Aligning Text-to-Image Models using Human Feedback
Diffusion Model Alignment Using Direct Preference Optimization
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Paper: http://arxiv.org/pdf/2504.11455
Code: http://github.com/wdrink/SimpleAR
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
