Artificial Intelligence 20 min read

VARGPT: A Unified Autoregressive Architecture for Multimodal Understanding and Generation

VARGPT is a novel multimodal large language model that unifies visual understanding and autoregressive image generation within a single architecture, extending LLaVA with next‑token and next‑scale prediction, trained through three staged data‑curated phases and achieving superior performance on numerous vision‑language benchmarks.

AIWalker

Feb 16, 2025

VARGPT: A Unified Autoregressive Architecture for Multimodal Understanding and Generation

Overview

Recent advances in multimodal artificial intelligence have dramatically improved both visual understanding and generation. Large multimodal language models (MLLMs) inherit the generality of large language models (LLMs) for understanding, while denoising diffusion probabilistic models (DDPMs) have pushed image synthesis forward. Inspired by the scaling laws of autoregressive LLMs, researchers have explored next‑token or next‑scale prediction for visual generation (e.g., Emu3, VAR, LlamaGen, HART, Infinity). Building on these successes, the community is now designing unified architectures that can handle both tasks.

Introduction of VARGPT

We propose VARGPT , a multimodal LLM that unifies visual understanding and generation in a single autoregressive framework. VARGPT follows a next‑token prediction paradigm for understanding and a next‑scale prediction paradigm for visual autoregressive generation. By extending the LLaVA architecture, VARGPT efficiently supports scale‑autoregressive visual generation while seamlessly handling mixed‑modality inputs and outputs within one model.

Model Architecture

The VARGPT framework (see Figure 4) consists of:

A large language model backbone with causal attention.

A visual encoder (CLIP ViT‑/14) and a two‑layer linear projector for visual understanding.

A visual decoder (30‑layer Transformer) and dual generation projectors for image synthesis.

For visual understanding, images are encoded, projected, and aligned with text embeddings before being fed to the LLM for next‑token prediction. For visual generation, a multi‑scale image tokenizer provides visual tokens; VARGPT predicts the next scale of these tokens using a block‑causal attention mechanism, then decodes them with a VAE to produce images.

Special tokens such as <image_gen>, <image_gen_start>, and <image_gen_end> demarcate image‑generation regions in the output sequence.

Training Procedure

Stage 1: Pre‑training

Using ImageNet images, we construct 1.28 M single‑turn dialogue samples to pre‑train the two generation projectors while freezing all other parameters (see Figure 5).

Stage 2: Supervised Fine‑tuning for Visual Understanding

We unfreeze the language model and the visual‑encoding projector, then train on a curated multimodal instruction dataset (including LLaVA‑1.5, LLaVA‑OneVision, and ImageNet‑Instruct‑130K). This stage aligns visual understanding with instruction following.

Stage 3: Supervised Fine‑tuning for Visual Generation

We unfreeze the visual decoder and both generation projectors, keeping other components frozen, and fine‑tune on image‑generation instruction pairs derived from ImageNet‑Instruct‑130K and a larger ImageNet‑Instruct‑1270K dataset.

Unified Instruction‑Following Data

Three training phases use distinct data mixes. The image‑generation instruction set (ImageNet‑Instruct‑130K) is built by seeding a large‑language model (Deepseek‑V3) with prompts and answers generated from 1‑K‑VL‑Enriched captions. Samples are filtered and combined with existing multimodal instruction data to form the mixed‑instruction corpus (see Figures 7‑8).

Experiments and Results

All images are resized to 256×256. VARGPT uses the LLaVA‑1.5‑7B‑hf backbone, a VAR‑d30 visual decoder (~2 B parameters), and a multi‑scale VQ‑VAE for tokenization. Generation uses top‑k = 900, top‑p = 0.95, and CFG = 1.5.

We evaluate on 11 multimodal benchmarks (MMBench‑dev, SEED‑bench, MMMU, POPE, MME, GQA, TextVQA, VQAv2, SciQA‑img, OKVQA, VizWizQA) and on a custom 50 k‑sample instruction‑to‑image set using CLIP score and FID.

Compared with LLaVA‑1.5, MiniGPT‑4, InstructBLIP, Qwen‑VL, and unified models such as Chameleon, SEED‑LLaMA, Show‑o, and VILA‑U, VARGPT consistently outperforms on visual understanding metrics and achieves competitive generation quality despite using a smaller image‑generation dataset (1.28 M vs. 15‑30 M in other models).

Qualitative examples (Figure 9) demonstrate VARGPT’s ability to output text and images within the same dialogue, confirming its unified capability.

Conclusion

VARGPT introduces a unified autoregressive framework that jointly tackles visual understanding and generation. By combining next‑token and next‑scale prediction, employing a three‑stage training pipeline, and leveraging carefully curated multimodal instruction data, VARGPT achieves state‑of‑the‑art performance on a wide range of vision‑language tasks, highlighting its potential to drive future research in unified multimodal AI.

large language model multimodal Image Generation AI research VARGPT visual understanding

Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.