Artificial Intelligence 11 min read

How Orthus Achieves Lossless Multimodal Generation with a Unified Autoregressive Transformer

Orthus, a new unified multimodal model presented at ICML 2025, leverages an autoregressive Transformer backbone with separate language and diffusion heads to enable lossless image‑text interleaved generation, outperforming existing models on both understanding and generation benchmarks while remaining computationally efficient.

Kuaishou Tech

Jul 22, 2025

How Orthus Achieves Lossless Multimodal Generation with a Unified Autoregressive Transformer

Research Background

Unified multimodal generation and understanding is a hot research topic. At ICML 2025, Kuaishou and Shanghai Jiao Tong University introduced Orthus, a lossless image‑text interleaved generation paradigm based on an autoregressive Transformer.

Key Contributions

Autoregressive Transformer backbone.

Handles discrete text tokens and continuous image features.

Separate language head and diffusion MLP head for text and image generation.

Computationally efficient.

Model Architecture

Orthus consists of a tokenizer, a visual encoder, modality‑specific embedding modules, a shared Transformer backbone, and two modality‑specific heads. The backbone models intra‑modal (text‑text) and cross‑modal (text‑image) dependencies. The language head predicts text tokens, while the diffusion head predicts continuous image features.

Training Strategy

We replace the vector‑quantization step of pure autoregressive models with a soft alternative and add a diffusion head trained with standard diffusion loss. This allows the backbone to focus on modality interactions while the diffusion head reconstructs image details, improving efficiency and preserving image fidelity.

Experimental Results

Orthus surpasses Chameleon and Show‑o on multiple visual understanding benchmarks and outperforms the dedicated diffusion model SDXL on the GenEval text‑to‑image metric, despite using minimal compute. It also demonstrates strong image editing, web‑page generation, and interleaved text‑image generation capabilities.

Conclusion

The study presents Orthus, a lossless multimodal model that unifies understanding and generation via a shared autoregressive Transformer and modality‑specific heads. By preserving continuous image representations and decoupling diffusion from the backbone, Orthus achieves state‑of‑the‑art performance across a range of multimodal tasks, paving the way for future unified multimodal research.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

text-to-image Image Generation Diffusion Models AI research autoregressive transformer

Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.