Beyond One-Word Prompts: How the Open-Source GenEvolve Agent Uses Tool Orchestration for Image Generation

GenEvolve, an open-source self-evolving image-generation agent, orchestrates search, image retrieval, and knowledge tools into a prompt-reference program, handling knowledge-anchored and quality-anchored tasks; experiments show it outperforms baseline generators on both standard and strong renderers, with open data and code released.

Machine Heart
Machine Heart
Machine Heart
Beyond One-Word Prompts: How the Open-Source GenEvolve Agent Uses Tool Orchestration for Image Generation

Image generation is shifting from a single prompt toward an open-task workflow where users often need alignment with landmarks, people, products, or events, as well as visual references and hard constraints; a single forward pass of a generator cannot reliably satisfy these complex demands.

GenEvolve Framework

Researchers from HKUST (Guangzhou), Meituan, HKUST, and NUS introduced GenEvolve , a self‑evolving image‑generation agent that treats generation as a tool‑orchestration trajectory . The agent first understands the request, then invokes a suite of tools—text search, image search, and knowledge query—and finally assembles the results into a prompt‑reference program for downstream generators.

Open Generation Demand Types

GenEvolve distinguishes two demand categories:

Knowledge‑Anchored : results depend on external world knowledge such as real buildings, public figures, product structures, or event clues.

Quality‑Anchored : results require verifiable visual constraints like text, counting, layout, attribute binding, anatomy, material, and aesthetics.

To satisfy them, the agent is equipped with three tool types: search(q) for factual evidence, image_search(q) for visual references, and query_knowledge(skill) for activating internal skills needed for complex rendering.

Multi‑Round Decision Process

Generation therefore becomes a multi‑round decision: what to search, which reference image to select, which knowledge skill to invoke, and which constraints to embed in the final program, rather than merely writing a longer prompt.

Data and Benchmark Construction

The team built GenEvolve‑Data and GenEvolve‑Bench from roughly 20 000 structured recipes covering entities, landmarks, products, events, text, layout, counting, attributes, anatomy, material, aesthetics, and creative transformation. Each request is processed by a Teacher Agent that executes the full tool pipeline, followed by program checks, VLM audit, ground‑truth rendering, visual filtering, and finally split into SFT trajectories, self‑evolution samples, and benchmark entries.

Training Procedure

Training proceeds in two steps. First, high‑quality Teacher trajectories are used to SFT‑fine‑tune Qwen3‑VL‑8B‑Instruct, teaching basic tool calls and program composition. Second, the rollout stage samples multiple trajectories per request, scores them with visual and textual evaluators, and applies GRPO to provide trajectory‑level reward signals.

Visual Experience Self‑Distillation

Trajectory‑level rewards indicate which trajectory is better but not why. GenEvolve compares the best and worst trajectories for the same request, extracts a structured Decision Guide (search target, reference choice, constraint pitfalls), and supplies it to a privileged teacher. The teacher then generates improved token distributions, and token‑level KL distillation transfers this decision habit to the student model, teaching it how to search, select references, and organize constraints when encountering similar requests.

Experimental Results

On the self‑built GenEvolve‑Bench, with the open‑source Qwen‑Image‑Edit‑2511 renderer, GenEvolve achieves a KScore of 0.3663 versus 0.3493 for Gen‑Searcher; with the stronger Nano Banana Pro renderer, KScore rises to 0.5739 compared to 0.5298 , demonstrating that the learned tool‑orchestration strategy transfers across renderers.

Ablation studies show that an untuned Qwen3‑VL workflow can use tool entry points but is unstable; SFT improves tool invocation and program quality; GRPO adds trajectory‑level optimization; and visual‑experience distillation further boosts Visual correctness, Knowledge‑Anchored, and Quality‑Anchored dimensions.

Out‑of‑domain evaluation on the WISE benchmark (no in‑domain fine‑tuning) yields a WiScore of 0.82 , surpassing GPT‑4o’s 0.80 , confirming the approach’s generalization ability.

Conclusion

GenEvolve moves open image generation from single‑prompt optimization to a learnable tool‑orchestration process, enabling tasks that require external knowledge, reference consistency, and multiple hard constraints. All models, code, data, and benchmarks are open‑source, providing a reproducible foundation for research on image‑generation agents, tool usage, visual‑feedback reinforcement learning, and open‑generation evaluation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

open sourcebenchmarkimage generationagentic AItool orchestrationGenEvolvevisual experience distillation
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.