Manzano: A Small 3B Multimodal Model That Unifies Image Understanding and Generation with SOTA Performance

Manzano introduces a hybrid vision tokenizer and a three‑stage training recipe that let a 3‑billion‑parameter multimodal LLM achieve state‑of‑the‑art results on both image‑understanding benchmarks and text‑to‑image generation, while scaling smoothly to larger sizes and minimizing task conflict.

AIWalker
AIWalker
AIWalker
Manzano: A Small 3B Multimodal Model That Unifies Image Understanding and Generation with SOTA Performance

Key Highlights

Hybrid tokenizer design : a shared visual encoder plus two lightweight adapters produce both continuous embeddings for understanding and discrete tokens for generation, reducing the conflict between the two tasks.

Unified and scalable training recipe : a three‑stage strategy (pre‑training, continued pre‑training, supervised fine‑tuning) that jointly learns from mixed image‑text data and supports easy scaling.

Strong competitiveness and scalability : a 3B model reaches SOTA on many benchmarks; scaling experiments show consistent gains as parameters increase, especially on text‑rich tasks and image‑structure fidelity.

Problem Statement

Existing open‑source multimodal LLMs suffer a performance trade‑off between image understanding (which prefers continuous representations) and image generation (which prefers discrete tokens). This representation clash degrades performance on text‑dense tasks, and current solutions such as dual tokenizers or mixed‑expert models are either parameter‑inefficient, architecturally complex, or hard to scale.

Proposed Solution: Manzano

Manzano is a unified multimodal framework built around two core innovations:

Hybrid image tokenizer : a shared visual transformer feeds two adapters—continuous for understanding and discrete (via FSQ) for generation—producing tokens that live in a common semantic space.

Carefully designed training strategy : a three‑phase regimen (pre‑training on massive text‑only, image‑text, IT, and TI data; continued pre‑training on higher‑quality data; supervised fine‑tuning with curated instruction data) that jointly optimizes understanding and generation.

Technical Components

Hybrid tokenizer : uses a ViT backbone, a spatial‑to‑channel (STC) compression layer, and separate MLP heads for continuous and discrete adapters. The discrete branch employs finite‑scalar quantization (FSQ) with a 64K codebook.

Unified LLM decoder : a standard autoregressive language model that predicts both text and image tokens from a joint vocabulary.

Image decoder : a diffusion‑based decoder (DiT‑Air) that converts predicted image tokens into high‑fidelity pixels.

Architecture Overview

The hybrid tokenizer encodes an input image, producing continuous embeddings for I2T tasks and discrete token IDs for T2I tasks. These are fed to the unified LLM, which predicts the next token in an autoregressive fashion. For generation, the predicted image token sequence is passed to the diffusion decoder to render the final image.

Training Procedure

Data Mix

Training data are mixed across three stages:

Pre‑training : large‑scale pure text, interleaved image‑text, image‑to‑text (IT), and text‑to‑image (TI) corpora.

Continued pre‑training : 24 M high‑quality, capability‑oriented samples covering documents, charts, OCR, knowledge, reasoning, and synthetic descriptions.

Supervised fine‑tuning (SFT) : curated instruction data for both understanding (75% image‑text, 25% pure text) and generation (real + synthetic text‑image pairs, including 9 × 10⁴ pairs from DALLE‑3, BLIP‑3o, ShareGPT‑4o and 4 × 10⁶ pairs generated via Flux‑1‑schnell).

Training Steps

Pre‑training : 1.6 T tokens for the 30B model (0.8 T for 300M), using a 40/40/20 mix of text, IT, and TI data.

Continued pre‑training : additional 83 B tokens.

SFT : data mixed 41/45/14 (understanding/generation/text‑instruction) with the same token‑level weighting (text loss : image loss = 1 : 0.5).

Tokenizer Training

A small 300M LLM decoder is attached to the shared visual encoder; for each sample a random adapter (continuous or discrete) is selected, and the LLM is trained with next‑token prediction. After convergence the small decoder is discarded, leaving the hybrid tokenizer for the full model.

Unified LLM Training

The visual encoder and discrete adapter are frozen; the LLM’s embedding table is expanded to 64 K image tokens. Loss weighting balances text and image objectives (1 : 0.5). Training proceeds through the three stages with the same data mixes as above.

Image Decoder Training

The diffusion decoder follows a progressive resolution schedule (starting at low resolution, then fine‑tuning at higher resolutions) with 400 k steps at the base resolution and 100 k steps at each higher resolution.

Experiments

Evaluation

Understanding is measured on VQA (SeedBench, RealWorldQA, MMBench), knowledge & reasoning (AI2D, ScienceQA, MMMU, MathVista), and text‑dense document/chart tasks (ChartQA, TextVQA, DocVQA, InfoVQA, OCRBench). Generation is evaluated with automatic metrics (GenEval, DPG‑Bench, WISE) and a human study of 800 challenging prompts scored on structure, instruction adherence, and aesthetics.

Understanding‑Generation Interaction

Two ablations compare tokenizer strategies: (i) pure discrete (replacing continuous features with discrete tokens) and (ii) dual‑encoder (separate encoders for understanding and generation). Results show the hybrid tokenizer incurs the smallest task conflict and outperforms both baselines across all tasks.

Unified vs. Single‑Task

Manzano is compared with models trained exclusively for understanding or generation. Even at 300M scale, the unified model matches or exceeds single‑task baselines, demonstrating that the hybrid tokenizer successfully unifies perception and generation without trade‑offs.

Scaling Behavior

LLM scaling : increasing the decoder from 300M to 30B yields monotonic gains on all understanding and generation metrics (e.g., +14.2 on general VQA, +18.8 on knowledge benchmarks, +12.0 on WISE for the 3B→30B jump).

Image decoder scaling : larger decoders improve structural integrity (+9.9) while keeping instruction adherence stable; aesthetic quality shows slight decline.

Comparison with Other Models

Manzano is benchmarked against state‑of‑the‑art unified models (Janus‑Pro, X‑Omni, Bagel) and dedicated specialist models. Across almost all understanding tasks—including knowledge, general VQA, and text‑dense benchmarks—the 3B Manzano matches or surpasses larger unified models and often beats dedicated specialists. On generation, Manzano achieves SOTA on GenEval and WISE, with the 30B version further improving image quality.

Image Editing Extension

By feeding a reference image to both the LLM and the diffusion decoder, Manzano can perform precise instruction‑guided editing, style transfer, inpainting, out‑painting, and depth estimation, demonstrating pixel‑level control while preserving semantic consistency.

Conclusion

Manzano combines a hybrid vision tokenizer with a unified autoregressive backbone and a lightweight diffusion decoder, delivering SOTA performance on both multimodal understanding and text‑to‑image generation. The three‑stage training pipeline, minimal task interference, and strong scaling properties make it a compelling baseline for future research in unified multimodal AI.

References

[1] MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Large Language ModelmultimodalAI researchhybrid tokenizerManzano
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.