How Orthus Achieves Lossless Multimodal Generation with a Unified Autoregressive Transformer
Orthus, a new unified multimodal model presented at ICML 2025, leverages an autoregressive Transformer backbone with separate language and diffusion heads to enable lossless image‑text interleaved generation, outperforming existing models on both understanding and generation benchmarks while remaining computationally efficient.
Research Background
Unified multimodal generation and understanding is a hot research topic. At ICML 2025, Kuaishou and Shanghai Jiao Tong University introduced Orthus, a lossless image‑text interleaved generation paradigm based on an autoregressive Transformer.
Key Contributions
Autoregressive Transformer backbone.
Handles discrete text tokens and continuous image features.
Separate language head and diffusion MLP head for text and image generation.
Computationally efficient.
Model Architecture
Orthus consists of a tokenizer, a visual encoder, modality‑specific embedding modules, a shared Transformer backbone, and two modality‑specific heads. The backbone models intra‑modal (text‑text) and cross‑modal (text‑image) dependencies. The language head predicts text tokens, while the diffusion head predicts continuous image features.
Training Strategy
We replace the vector‑quantization step of pure autoregressive models with a soft alternative and add a diffusion head trained with standard diffusion loss. This allows the backbone to focus on modality interactions while the diffusion head reconstructs image details, improving efficiency and preserving image fidelity.
Experimental Results
Orthus surpasses Chameleon and Show‑o on multiple visual understanding benchmarks and outperforms the dedicated diffusion model SDXL on the GenEval text‑to‑image metric, despite using minimal compute. It also demonstrates strong image editing, web‑page generation, and interleaved text‑image generation capabilities.
Conclusion
The study presents Orthus, a lossless multimodal model that unifies understanding and generation via a shared autoregressive Transformer and modality‑specific heads. By preserving continuous image representations and decoupling diffusion from the backbone, Orthus achieves state‑of‑the‑art performance across a range of multimodal tasks, paving the way for future unified multimodal research.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
