How T2I‑R1 Boosts Text‑to‑Image Generation with Dual‑Level CoT Reasoning

Recent large language models have shown strong reasoning abilities, and this work extends chain‑of‑thought reasoning to autoregressive image generation by introducing T2I‑R1, a dual‑level (Semantic‑CoT and Token‑CoT) framework trained with reinforcement learning that unifies high‑level planning and low‑level token generation, achieving state‑of‑the‑art results.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
How T2I‑R1 Boosts Text‑to‑Image Generation with Dual‑Level CoT Reasoning

Background

Large language models such as OpenAI o1 and DeepSeek‑R1 achieve strong reasoning abilities through chain‑of‑thought (CoT) reasoning reinforced by reinforcement learning. Extending CoT to autoregressive image generation remains an open problem.

T2I‑R1 Overview

T2I‑R1 is a text‑to‑image generation model that incorporates a two‑level CoT framework—Semantic‑CoT and Token‑CoT—and jointly optimizes them with reinforcement learning.

image
image

Semantic‑CoT

Performs textual reasoning about the target image before generation.

Designs the global structure of the image, including object appearance and spatial layout.

Explicit planning of the prompt simplifies subsequent token generation.

Token‑CoT

Generates image tokens block‑by‑block, analogous to CoT in language models.

Focuses on low‑level details such as pixel synthesis and visual continuity between adjacent patches.

Optimizing Token‑CoT improves image quality and alignment with the prompt.

Unified CoT Framework (BiCoT‑GRPO)

Starting from the unified multimodal model Janus‑Pro, we enhance it to jointly optimize Semantic‑CoT and Token‑CoT within a single training loop:

Use the ULM to imagine and plan the image from the prompt, producing Semantic‑CoT.

Feed both the original prompt and the generated Semantic‑CoT back into the ULM to generate the image, yielding Token‑CoT.

Generate multiple Semantic‑CoT/Token‑CoT pairs for each prompt, compute relative rewards among the generated images, and apply Group‑wise Reward‑Based Policy Optimization (GRPO) to simultaneously update both CoT levels in one iteration.

Because image generation lacks a well‑defined reward function, we construct a reward model by ensembling several visual expert models. This ensemble provides multi‑aspect quality assessment and regularizes the ULM to prevent over‑fitting to any single reward model.

image
image

Experiments

Qualitative evaluation shows that T2I‑R1 better infers the true intent behind prompts, producing results that align more closely with human expectations and demonstrating increased robustness on unusual scenes.

Quantitatively, T2I‑R1 outperforms baselines on the T2I‑CompBench and WISE benchmarks, improving performance by 13 % and 19 % respectively, and surpasses the previous state‑of‑the‑art model FLUX.1 on several sub‑tasks.

image
image
image
image

Conclusion

T2I‑R1 is the first reinforcement‑learning‑driven, reasoning‑enhanced text‑to‑image model that unifies high‑level semantic planning and low‑level token generation through a dual‑level CoT framework, achieving both qualitative and quantitative gains.

Paper: https://arxiv.org/pdf/2505.00703 Code:

https://github.com/CaraJ7/T2I-R1
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

text-to-imagereinforcement learninggenerative AIsemantic planning
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.