From Direct Generation to Agentic Text-to-Image: Introducing the Open-Source Gen-Searcher
Gen-Searcher equips text-to-image models with searchable, reasoning, and web‑browsing capabilities, turning the traditional direct‑generation pipeline into an agentic system that fetches and verifies real‑world knowledge, dramatically improving accuracy and quality across multiple benchmarks.
In the past two years, image‑generation models have rapidly improved in texture and aesthetics, yet most still follow a "direct generation" paradigm where a prompt is immediately turned into an image.
When prompts require real‑world knowledge, up‑to‑date information, obscure facts, or cross‑source verification, conventional text‑to‑image models often fail because they lack agentic abilities and rely solely on static parametric knowledge.
The research teams from Hong Kong University of Science and Technology (MMLab), UC Berkeley, and UCLA introduced Gen-Searcher, the first attempt to train a "deep search" agent for image‑generation tasks. Gen-Searcher enables the model to search, reason, retrieve images, and browse the web before producing the final picture, and all data, models, and code are open‑source.
The authors first built a dataset of generation tasks that require real‑world search, covering roughly 20 categories such as celebrities, anime, physics, chemistry, art, architecture, and news. Using a strong model paired with search tools, they generated multi‑turn trajectories, collected textual knowledge and visual evidence, and synthesized target images with Nano Banana Pro, yielding about 30 k raw samples. After a Seed1.8 filtering step, roughly 17 k high‑quality samples remained, organized into Gen-Searcher‑SFT‑10k and Gen-Searcher‑RL‑6k.
They also introduced a new benchmark called KnowGen, containing 630 manually verified samples for evaluating image‑generation agents.
The core of Gen-Searcher is a trainable agent that, instead of directly generating from a prompt, first decides in multiple interaction rounds when to search, what to search, whether to browse webpages, and whether to add visual references, finally outputting an accurate prompt and reference images. The agent is equipped with three tool types: text search, image search, and web browsing.
Training proceeds in two stages: supervised fine‑tuning (SFT) teaches tool usage, followed by agentic reinforcement learning (RL) that optimizes search strategies and long‑term decisions. A dual‑reward system combines an image reward with a textual reward that evaluates whether the generated prompt contains sufficient and correct information, ensuring the model both "draws well" and "searches correctly".
Experimental results show substantial gains. On the KnowGen benchmark, the original Qwen‑Image model achieved a K‑Score of 14.98, which rose to 31.52 after integrating Gen‑Searcher‑8B (+16.54). The improvement transfers to other generators: Seedream 4.5 increased from 31.01 to 47.29, and Nano Banana Pro rose from 50.38 to 53.30. Similar large improvements were observed on the WISE benchmark, and visual analyses confirm that Gen‑Searcher markedly enhances both accuracy and quality of generated images.
In summary, Gen‑Searcher demonstrates the potential of agentic generation for knowledge‑intensive image‑generation tasks and provides a clear pathway toward integrated systems that combine search, reasoning, and generation, marking a significant step toward the agentic era of multimodal AI.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
