Mind-Brush: ‘Think‑Research‑Create’ Intent Reasoning for Image Generation

Mind-Brush introduces a ‘think‑research‑create’ agentic framework that unifies intent analysis, multimodal evidence retrieval, and knowledge‑driven reasoning to transform text‑to‑image generation from static decoding into an active cognitive workflow, achieving large accuracy gains on the new Mind‑Bench benchmark and surpassing existing SOTA models.

AIWalker
AIWalker
AIWalker
Mind-Brush: ‘Think‑Research‑Create’ Intent Reasoning for Image Generation

Problem

Current text‑to‑image models act as static decoders: they map explicit prompts to pixels without understanding implicit user intent or performing multi‑step knowledge reasoning. Pre‑training imposes a temporal cutoff, so models cannot incorporate real‑time facts, news, or emerging concepts, leading to hallucinations on out‑of‑distribution (OOD) entities.

Mind‑Brush Framework

The proposed solution is a training‑free, agentic workflow called “think‑research‑create”.

Think – Intent Analysis

An Intent Analysis Agent parses the user command (and optional reference image) into a structured 5W1H representation (What, When, Where, Why, Who, How). This representation becomes the initial Cognitive State (C) , which also stores an evidence buffer for later retrievals.

Research – Active Retrieval & Reasoning

A Cognitive Gap Detector examines the 5W1H state, identifies missing factual entities or logical dependencies, and formulates a set of atomic sub‑problems.

External Knowledge Anchoring : A Cognition Search Agent generates precise text and visual queries, retrieves documents from open‑world knowledge bases, and injects the retrieved concepts back into the prompt and visual query.

Internal Logical Derivation : A Chain‑of‑Thought (CoT) Knowledge Reasoning Agent consumes the user instruction, optional image, and accumulated evidence, then performs multi‑step inference (e.g., solving a math problem or deducing spatial relations) to produce explicit conclusions.

Create – Constraint‑Guided Generation

A Concept Review Agent filters noisy evidence, merges verified facts with the original intent, and composes a master prompt . A Unified Image Generation Agent then synthesizes the final image, dynamically selecting between pure generation and editing modes based on the master prompt.

Formal Architecture

The workflow is modeled as a Hierarchical Sequential Decision‑Making Process.

Cognitive State (C) : captures user input, optional reference image, and the evidence buffer.

Action Space (A) : divided into meta‑actions for gap detection and execution actions for retrieval or reasoning.

Execution Policy (π) : the intent analysis module deterministically selects an execution path based on the identified gap, allowing the plan to adapt dynamically (e.g., factual grounding vs. logical reasoning).

The process iterates until a converged state containing the master prompt and verified visual references is reached, at which point the image is generated.

Mind‑Bench Benchmark

To evaluate “cognitive generation”, the authors built Mind‑Bench, a 500‑sample benchmark covering ten sub‑domains:

Knowledge‑driven tasks : real‑world events, weather, characters, objects, world knowledge – emphasizing OOD entity handling.

Reasoning‑driven tasks : life reasoning, geographic reasoning, mathematics, science, logic, poetry – requiring inference of implicit constraints.

Evaluation uses a Checklist‑based Strict Accuracy (CSA) metric: a sample is counted correct only if all checklist items pass under a holistic pass criterion.

Experimental Results

Mind‑Bench : CSA rises from 0.02 (Qwen‑Image baseline) to 0.31, a >15× improvement, surpassing Stable Diffusion 3.5 Large and exceeding GPT‑Image‑1.5 (0.21).

WISE : WiScore of 0.78, a 25.8% gain over Qwen‑Image and matching top‑ranked GPT‑Image‑1.

RISEBench : Instruction‑Reasoning score 61.5, overall accuracy 24.7%, comparable to leading proprietary models and outperforming Bagel.

Qualitative visualizations (Figures 4, 19, 20) show successful retrieval of niche concepts and correct logical decomposition in math and geography tasks, avoiding the hallucinations seen in baselines.

Ablation Studies

Removing either the cognitive‑search agent or the reasoning agent degrades performance on their respective domains; their combination yields the best overall results (Table 3). Experiments with stronger backbones (e.g., GPT‑5.1 instead of Qwen‑3‑VL) and stronger image generators (e.g., GPT‑Image‑1) further amplify gains (Table 6).

Extended Benchmarks

On GenEval++ (instruction compliance) and Imagine‑Bench (creative generation), Mind‑Brush outperforms the agentic baseline GenAgent, especially on location/counting and spatio‑temporal transformation sub‑tasks.

Key Technical Components

Agentic Design : LLM‑style agents enable task decomposition and planning.

Active Retrieval : The cognition search agent can query external multimodal sources to obtain up‑to‑date facts.

External Reasoning Tools : The CoT reasoning agent performs multi‑step logical inference.

Concept Review : Filters and integrates evidence into a coherent master prompt.

Unified Generation : Conditions the image synthesis on both textual alignment and adaptive visual cues, switching between generation and editing as needed.

Resources

Paper: https://arxiv.org/pdf/2602.01756

Code: https://github.com/PicoTrex/Mind-Brush

Dataset: https://huggingface.co/datasets/PicoTrex/Mind-Brush

Mind-Brush architecture diagram
Mind-Brush architecture diagram
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

benchmarkImage GenerationAgentic AImultimodal reasoningMind-Brush
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.