Is One‑Prompt Image Generation Obsolete? Meet GenEvolve’s Tool‑Orchestrated Agents

GenEvolve introduces a self‑evolving image‑generation agent that orchestrates search, reference retrieval, and knowledge‑query tools into a prompt‑reference program, training via teacher‑student SFT and visual‑experience self‑distillation to achieve higher KScore on open‑source and strong generators.

Data Party THU
Data Party THU
Data Party THU
Is One‑Prompt Image Generation Obsolete? Meet GenEvolve’s Tool‑Orchestrated Agents

Background

Traditional image generation uses a single prompt, which cannot reliably satisfy complex demands such as aligning with landmarks, specific objects, or detailed visual constraints. A single forward pass of a generative model often fails on open‑ended tasks.

GenEvolve Framework

GenEvolve (“Self‑Evolving Image Generation Agents via Tool‑Orchestrated Visual Experience Distillation”) defines a tool‑orchestration trajectory. An agent first understands the request, then invokes three categories of tools: search(q) – textual fact retrieval image_search(q) – visual reference retrieval query_knowledge(skill) – activation of internal knowledge for layout, material consistency, etc.

The agent assembles external evidence and constraints into a prompt‑reference program that is passed to a downstream image generator.

GenEvolve tool orchestration diagram
GenEvolve tool orchestration diagram

Data and Training

To train the agent, the authors constructed GenEvolve‑Data and GenEvolve‑Bench . Starting from ~20 k structured recipes covering entities, landmarks, products, events, text, layout, counting, attributes, anatomy, material, aesthetics, and creative transformations, each request is processed by a Teacher Agent that runs the full tool pipeline, producing a prompt‑reference program. The trajectories undergo program checks, VLM audits, GT image rendering, and visual filtering, then are split into SFT trajectories, self‑evolution samples, and benchmark entries.

GenEvolve data pipeline
GenEvolve data pipeline

Self‑Evolution and Visual Experience Distillation

Training proceeds in two stages. First, high‑quality Teacher trajectories are used for SFT cold‑start on Qwen3‑VL‑8B‑Instruct, teaching basic tool calls and program composition. Next, a rollout phase samples multiple trajectories per request, renders images, and scores them with visual and textual evaluators. GRPO provides trajectory‑level reward, while visual‑experience self‑distillation compares the best and worst trajectories, extracting a structured Decision Guide (e.g., which search query to use, which reference to select, which constraints to avoid). The privileged teacher receives the guide; the Student sees only ordinary inputs. Token‑level KL‑distillation transfers the teacher’s preferences into the student’s parameters, enabling the model to learn decision habits rather than memorizing specific examples.

Self‑distillation workflow
Self‑distillation workflow

Experimental Results

On GenEvolve‑Bench with the open‑source Qwen‑Image‑Edit‑2511 as the base generator, GenEvolve achieves an overall KScore of 0.3663, surpassing the Gen‑Searcher baseline (0.3493). The advantage is pronounced on Knowledge‑Anchored tasks that require factual and visual detail.

When paired with the stronger Nano Banana Pro renderer, GenEvolve’s KScore rises to 0.5739, exceeding the bare Nano Banana Pro score of 0.5298, demonstrating transferability of the learned tool‑orchestration strategy across generators.

Ablation studies show that an untuned Qwen3‑VL workflow can already use tool inputs but is unstable; SFT improves tool invocation and program quality; GRPO adds trajectory‑level optimization; visual‑experience self‑distillation further boosts Visual Correctness, Knowledge‑Anchored, and Quality‑Anchored dimensions.

Cross‑domain evaluation on the public WISE benchmark (no in‑domain fine‑tuning) yields a WiScore of 0.82 for GenEvolve with Qwen‑Image‑Edit, outperforming GPT‑4o’s 0.80.

Conclusion

GenEvolve shifts open‑ended image generation from single‑prompt optimization to a learnable tool‑orchestration process. By integrating external knowledge, visual references, and hard constraints through a self‑evolving agent, it consistently improves performance across diverse generators and provides a reproducible foundation for future research on image‑generation agents, tool usage, visual feedback reinforcement learning, and open‑generation evaluation.

Paper: https://arxiv.org/abs/2605.21605

Code: https://github.com/MeiGen-AI/GenEvolve

Data & Benchmarks: https://huggingface.co/datasets/MeiGen-AI/GenEvolve-Data-Bench

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Benchmarkimage generationagentic AIself‑evolutiontool orchestrationvisual distillation
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.