How T2I‑R1 Boosts Text‑to‑Image Generation with Dual‑Level CoT Reasoning
Recent large language models have shown strong reasoning abilities, and this work extends chain‑of‑thought reasoning to autoregressive image generation by introducing T2I‑R1, a dual‑level (Semantic‑CoT and Token‑CoT) framework trained with reinforcement learning that unifies high‑level planning and low‑level token generation, achieving state‑of‑the‑art results.
Background
Large language models such as OpenAI o1 and DeepSeek‑R1 achieve strong reasoning abilities through chain‑of‑thought (CoT) reasoning reinforced by reinforcement learning. Extending CoT to autoregressive image generation remains an open problem.
T2I‑R1 Overview
T2I‑R1 is a text‑to‑image generation model that incorporates a two‑level CoT framework—Semantic‑CoT and Token‑CoT—and jointly optimizes them with reinforcement learning.
Semantic‑CoT
Performs textual reasoning about the target image before generation.
Designs the global structure of the image, including object appearance and spatial layout.
Explicit planning of the prompt simplifies subsequent token generation.
Token‑CoT
Generates image tokens block‑by‑block, analogous to CoT in language models.
Focuses on low‑level details such as pixel synthesis and visual continuity between adjacent patches.
Optimizing Token‑CoT improves image quality and alignment with the prompt.
Unified CoT Framework (BiCoT‑GRPO)
Starting from the unified multimodal model Janus‑Pro, we enhance it to jointly optimize Semantic‑CoT and Token‑CoT within a single training loop:
Use the ULM to imagine and plan the image from the prompt, producing Semantic‑CoT.
Feed both the original prompt and the generated Semantic‑CoT back into the ULM to generate the image, yielding Token‑CoT.
Generate multiple Semantic‑CoT/Token‑CoT pairs for each prompt, compute relative rewards among the generated images, and apply Group‑wise Reward‑Based Policy Optimization (GRPO) to simultaneously update both CoT levels in one iteration.
Because image generation lacks a well‑defined reward function, we construct a reward model by ensembling several visual expert models. This ensemble provides multi‑aspect quality assessment and regularizes the ULM to prevent over‑fitting to any single reward model.
Experiments
Qualitative evaluation shows that T2I‑R1 better infers the true intent behind prompts, producing results that align more closely with human expectations and demonstrating increased robustness on unusual scenes.
Quantitatively, T2I‑R1 outperforms baselines on the T2I‑CompBench and WISE benchmarks, improving performance by 13 % and 19 % respectively, and surpasses the previous state‑of‑the‑art model FLUX.1 on several sub‑tasks.
Conclusion
T2I‑R1 is the first reinforcement‑learning‑driven, reasoning‑enhanced text‑to‑image model that unifies high‑level semantic planning and low‑level token generation through a dual‑level CoT framework, achieving both qualitative and quantitative gains.
Paper: https://arxiv.org/pdf/2505.00703 Code:
https://github.com/CaraJ7/T2I-R1Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
