How Orthus Achieves Lossless Multimodal Generation with a Unified Autoregressive Transformer

Orthus, a new unified multimodal model presented at ICML 2025, leverages an autoregressive Transformer backbone with separate language and diffusion heads to enable lossless image‑text interleaved generation, outperforming existing models on both understanding and generation benchmarks while remaining computationally efficient.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
How Orthus Achieves Lossless Multimodal Generation with a Unified Autoregressive Transformer

Research Background

Unified multimodal generation and understanding is a hot research topic. At ICML 2025, Kuaishou and Shanghai Jiao Tong University introduced Orthus, a lossless image‑text interleaved generation paradigm based on an autoregressive Transformer.

Key Contributions

Autoregressive Transformer backbone.

Handles discrete text tokens and continuous image features.

Separate language head and diffusion MLP head for text and image generation.

Computationally efficient.

Model Architecture

Orthus consists of a tokenizer, a visual encoder, modality‑specific embedding modules, a shared Transformer backbone, and two modality‑specific heads. The backbone models intra‑modal (text‑text) and cross‑modal (text‑image) dependencies. The language head predicts text tokens, while the diffusion head predicts continuous image features.

Model diagram
Model diagram

Training Strategy

We replace the vector‑quantization step of pure autoregressive models with a soft alternative and add a diffusion head trained with standard diffusion loss. This allows the backbone to focus on modality interactions while the diffusion head reconstructs image details, improving efficiency and preserving image fidelity.

Experimental Results

Orthus surpasses Chameleon and Show‑o on multiple visual understanding benchmarks and outperforms the dedicated diffusion model SDXL on the GenEval text‑to‑image metric, despite using minimal compute. It also demonstrates strong image editing, web‑page generation, and interleaved text‑image generation capabilities.

Generated image example
Generated image example
Interleaved text‑image generation
Interleaved text‑image generation

Conclusion

The study presents Orthus, a lossless multimodal model that unifies understanding and generation via a shared autoregressive Transformer and modality‑specific heads. By preserving continuous image representations and decoupling diffusion from the backbone, Orthus achieves state‑of‑the‑art performance across a range of multimodal tasks, paving the way for future unified multimodal research.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

text-to-imageImage GenerationDiffusion ModelsAI researchautoregressive transformer
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.