Artificial Intelligence 8 min read

How EvoSearch Boosts Image & Video Generation with Test‑Time Evolutionary Search

The EvoSearch method introduced by HKUST and Kuaishou’s KuaLing team leverages test‑time scaling to dramatically improve diffusion‑based image and video generation without training, using evolutionary search along the denoising trajectory, achieving state‑of‑the‑art results on SD2.1, Flux‑1‑dev and other models.

Kuaishou Large Model

Jul 3, 2025

How EvoSearch Boosts Image & Video Generation with Test‑Time Evolutionary Search

Test‑Time Scaling in Vision

Test‑time scaling (TTS) greatly improves large language models; in the visual domain it aims to boost image and video generation by increasing computation at inference without updating model parameters.

1. Essence of Test‑Time Scaling

Given a pretrained model and a reward function representing human preference, TTS seeks to fit a target distribution proportional to exp(r(x))·p(x) while keeping KL divergence small, but direct sampling is infeasible due to the high‑dimensional state spaces of diffusion and flow models.

2. Limitations of Existing Methods

RL‑based post‑training requires gradient updates and large compute; Best‑of‑N and particle sampling improve quality but lack exploration of new states and reduce diversity.

3. EvoSearch: Evolutionary Test‑Time Scaling

EvoSearch treats the denoising trajectory of diffusion/flow models as an evolutionary path. Each denoising step can mutate to explore higher‑quality offspring, ending with an optimal sample that matches the target distribution.

Two mutation modes are used: initial‑noise mutation (an orthogonal operation preserving the Gaussian distribution) and intermediate‑state mutation inspired by stochastic differential equations, controlled by a mutation rate σ.

Evolution schedule and population‑size schedule determine when and how many samples are evolved, adapting to the available test‑time compute.

4. Experimental Results

On image generation tasks, EvoSearch shows superior scaling on Stable Diffusion 2.1 and Flux‑1‑dev, maintaining performance even when test‑time compute increases by 10⁴×. For video generation, it achieves the highest reward gains on VBench, VBench2.0 and VideoGen‑Eval.

It also generalizes to out‑of‑distribution metrics, demonstrating strong robustness, and attains the best human‑evaluation win rate thanks to higher diversity.