SimpleAR: High‑Quality 1024×1024 Images with Just 0.5B Parameters via Pretraining, SFT, and RL

SimpleAR demonstrates that a vanilla autoregressive model with only 0.5 B parameters can generate high‑fidelity 1024×1024 images, covering pretraining, supervised fine‑tuning, and reinforcement learning, achieving competitive GenEval (0.59) and DPG‑Bench (79.66) scores while reducing inference time to about 14 seconds with vLLM and KV‑cache optimizations.

AIWalker
AIWalker
AIWalker
SimpleAR: High‑Quality 1024×1024 Images with Just 0.5B Parameters via Pretraining, SFT, and RL

SimpleAR: Minimal Autoregressive Visual Generation Framework

SimpleAR is a compact autoregressive (AR) visual generation model (0.5 B parameters) that integrates large‑scale pretraining, supervised fine‑tuning (SFT), and reinforcement learning (RL) to generate 1024×1024 images with high fidelity.

Why Autoregressive Models?

AR models predict each visual token sequentially, giving precise fine‑grained control and natural multimodal alignment. Historically they lag behind diffusion models because (1) discrete visual tokenizers limit quality and (2) visual token sequences are much longer than text, making long‑range dependency modeling harder.

Three‑Stage Training Strategy

Pretraining on diverse image‑text corpora (CC3M, CC12M, OpenImages, SAM1B, MegaImage ≈ 43 M samples) to learn generic visual‑language patterns.

SFT on curated high‑resolution datasets (JourneyDB, synthetic 1 M, internal 10 M) to improve image fidelity and instruction alignment.

RL using Group Relative Policy Optimization (GRPO) to further align multimodal outputs and reduce bias.

Pretraining and SFT Details

Both stages use a language‑modeling loss that predicts the next token conditioned on all previous visual tokens and the current text token. The visual tokenizer is Cosmos‑Tokenizer (codebook size 64 K, down‑sampling ratio 16). The transformer follows the Qwen decoder architecture and is initialized with large‑language‑model weights.

Reinforcement Learning with GRPO

After SFT, GRPO creates a trainable policy model and a frozen reference model. For each text prompt the policy samples a candidate token sequence; the objective maximizes a reward composed of a CLIP‑ViT‑H‑14 (or its fine‑tuned variant HPSv2) score minus a KL‑penalty: R = clip_score - \beta \cdot KL CLIP‑ViT‑H‑14 yields a larger GenEval gain (+0.6 points for the 0.5 B model) than HPSv2.

Inference and Acceleration Techniques

Inference proceeds token‑by‑token with greedy sampling (top‑k=64 000, i.e., the full codebook). Classifier‑free guidance (CFG) is applied to boost quality. To mitigate AR latency, SimpleAR adopts:

KV Cache : stores key‑value pairs from previous attention layers, reducing per‑step computation.

vLLM Serving : uses paged attention and optimized memory management for high‑throughput, low‑latency decoding.

Speculative Jacobi Decoding (SJD) : a draft model generates multiple candidate token sequences, which the target model validates, roughly halving the number of decoding steps.

Experimental Setup

All experiments run on 32 × NVIDIA A100 GPUs. Learning rates: 1e‑4 (pretraining), 2e‑5 (SFT), 1e‑5 (RL). Batch sizes: 256 (pretraining/SFT), 28 (RL). No warm‑up or LR decay. AdamW optimizer is used throughout.

Quantitative Results

SimpleAR (0.5 B) achieves GenEval 0.59 and DPG‑Bench 79.66, outperforming other sub‑1 B models and matching larger diffusion baselines that require separate text encoders. Scaling to 1.5 B improves GenEval by +0.04 and DPG‑Bench by +1.85, demonstrating predictable scaling similar to LLMs.

GenEval and DPG benchmark comparison
GenEval and DPG benchmark comparison

RL Ablation Study

Two reward modules were compared: CLIP‑ViT‑H‑14 and its fine‑tuned variant HPSv2. Both improve performance, but CLIP yields a larger GenEval gain (+0.6). Qualitative samples show better rendering of quantities and spatial relations when CLIP is used.

GRPO before/after generation results
GRPO before/after generation results

Inference Speedup

On an A100 node with CFG enabled, KV‑Cache reduces inference time by 34 %. Adding vLLM further cuts the generation time for a 1024×1024 image to 13.55 s. Speculative Jacobi Decoding halves the number of decoding steps, offering a modest DPG boost but not reducing wall‑clock latency because it cannot be combined with KV‑Cache.

Inference speed comparison with KV Cache and vLLM
Inference speed comparison with KV Cache and vLLM

Qualitative Samples and Failure Cases

High‑fidelity examples demonstrate strong instruction alignment on complex prompts. Failure cases reveal limitations in generating intricate poses, objects, or text, and occasional violations of physical laws due to limited data and model size.

SimpleAR DPG prompt results
SimpleAR DPG prompt results
SimpleAR GenEval prompt results
SimpleAR GenEval prompt results
SimpleAR failure cases
SimpleAR failure cases

References

Aligning Text-to-Image Models using Human Feedback

Diffusion Model Alignment Using Direct Preference Optimization

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Paper: http://arxiv.org/pdf/2504.11455

Code: http://github.com/wdrink/SimpleAR

vLLMbenchmarkreinforcement learningpretrainingSupervised Fine‑Tuningautoregressivevisual generation
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.