How OneCAT Redefines Multimodal AI with a Decoder‑Only Architecture
OneCAT introduces a unified decoder‑only transformer that eliminates separate visual encoders, employs a modality‑specific MoE, integrates multi‑scale visual generation, and achieves state‑of‑the‑art performance and efficiency across multimodal understanding, text‑to‑image synthesis, and image editing tasks.
Motivation
Typical multimodal systems are built as pipelines where separate models handle image understanding, image generation and image editing. This modular design introduces two major drawbacks: (1) the need to pass data between modules creates latency, and (2) visual information is compressed by an encoder before downstream processing, causing loss of fine‑grained details. OneCAT addresses both issues by proposing a fully unified decoder‑only architecture that processes language and vision jointly.
Architecture
OneCAT is built on the Qwen2.5 large language model (LLM) and adopts a pure decoder‑only, auto‑regressive transformer. Raw images are first projected to visual tokens by a lightweight Patch Embedding layer (a 14×14 convolution followed by pixel‑unshuffle and two MLP layers that align the channel dimension with the LLM hidden size). The resulting continuous visual tokens are concatenated with text tokens and fed into the same transformer.
Unified decoder‑only model : No separate visual encoder (ViT) or VAE tokenizer is required; image generation is performed directly by the LLM.
Joint NLL loss : A single negative log‑likelihood loss is used for both text and image generation. Text uses standard next‑token prediction (NTP) while images use a novel next‑scale prediction (NSP) that enables multi‑scale autoregressive synthesis.
Modality Mixture‑of‑Experts (MoE)
Each transformer block’s feed‑forward network (FFN) is expanded into three modality‑specific experts:
Text.FFN : Handles discrete visual tokens and performs language understanding and generation.
Und.FFN (Understanding) : Processes continuous visual tokens to extract visual features.
Gen.FFN (Generation) : Generates discrete visual tokens for image synthesis.
A hard routing mechanism directs tokens to the appropriate expert while sharing the QKV and attention layers, ensuring efficient cross‑modal alignment.
Multi‑Scale Visual Generation
OneCAT embeds a Visual AutoRegressive (VAR) module that predicts image tokens from coarse to fine scales. A Scale‑Aware Adapter (SAA) adds a low‑rank adapter for each scale inside Gen.FFN, allowing the model to treat tokens of different resolutions distinctly.
Flexible Multimodal Attention
Text tokens use causal (autoregressive) attention.
Continuous visual tokens use bidirectional attention to capture global context.
Discrete multi‑scale visual tokens use block‑wise causal attention: tokens within the same scale can attend to each other, while cross‑scale attention respects causality.
Training Strategy
Training proceeds in three stages:
Expert warm‑up : A frozen MLLM teacher (Qwen2.5 + InternViT) is first trained on ~10 M image‑text pairs. Visual knowledge is then distilled into OneCAT’s Patch Embedding and Und.FFN using 436 M image‑text pairs, and Gen.FFN is pre‑trained on 52 M text‑to‑image pairs. During this stage, Text.FFN, QKV and attention layers remain frozen.
Joint multimodal fine‑tuning : All parameters are unlocked and the model is trained on ~70 M multimodal instruction samples, 60 M image‑generation samples and 40 M pure‑text samples.
High‑quality refinement : A final fine‑tune uses ~11 M multimodal instructions, 3 M image‑generation samples and 2 M text samples to boost overall performance.
Performance Evaluation
Understanding : OneCAT‑3B achieves state‑of‑the‑art scores on OCR‑focused benchmarks (AI2D 77.8, ChartQA 81.2, InfoVQA 64.8, DocVQA 91.2) and on general multimodal tasks (MME‑S 2051, MMBench‑en 78.8, MM‑Vet 52.2, MathVista 61.7).
Text‑to‑Image Generation : Without any prompt rewriting, OneCAT‑3B scores 0.90 on GenEval and 84.53 on DPG‑Bench, surpassing Janus‑Pro‑7B and Tar‑7B.
Image Editing : On ImgEdit‑Bench the model obtains an overall score of 3.43, leading in background replacement, style transfer and attribute adjustment.
Compared with encoder‑based multimodal models, OneCAT uses fewer activated parameters while delivering comparable or superior accuracy.
Efficiency Analysis
Removing the visual encoder and integrating NSP reduces first‑token generation latency dramatically. For a 1792×1792 input, latency drops from 0.583 s (baseline) to 0.225 s (61 % speed‑up). Image synthesis times are 1.40 s for 512×512 and 2.85 s for 1024×1024 images, roughly 1/6–1/9 of the time required by BAGEL‑7B. Generation time scales almost linearly with resolution, indicating strong scalability.
Key Resources
Project page: https://onecat-ai.github.io/
Code repository: https://github.com/onecat-ai/onecat
Model on HuggingFace: https://huggingface.co/onecat-ai/OneCAT-3B
Code example
来源:PaperWeekly
本文
约3500字
,建议阅读
7
分钟
本文介绍 OneCAT 多模态模型,及其架构、训练策略、性能与高效推理优势。Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
