AdaGen: Enabling Adaptive, Data‑Driven Strategies for Image Generation Models

AdaGen replaces handcrafted static schedules in multi‑step image generators with a universal, learnable policy network trained via reinforcement learning, using an MDP formulation, adversarial rewards and action smoothing, achieving consistent quality and efficiency gains across diffusion, autoregressive, mask and flow models while adding negligible overhead.

Machine Heart
Machine Heart
Machine Heart
AdaGen: Enabling Adaptive, Data‑Driven Strategies for Image Generation Models

Motivation: From Static Hand‑Crafted Schedules to Adaptive Policies

Current multi‑step image generation models—including diffusion (e.g., DiT), autoregressive (e.g., VAR), mask‑based (e.g., MaskGIT) and flow (e.g., SiT) models—share a common paradigm of decomposing generation into a series of controllable steps. This paradigm requires a large set of hyper‑parameters (noise level, sampling temperature, guidance scale, etc.) that are typically managed by static, manually designed scheduling rules. Two major drawbacks are identified: (1) the need for extensive expert knowledge and repeated tuning, and (2) a "one‑size‑fits‑all" static strategy that cannot accommodate the unique characteristics of each sample.

AdaGen: A Universal, Learnable, Sample‑Adaptive Generation‑Strategy Framework

The paper proposes AdaGen, a framework that learns an adaptive policy for each sample. By training a lightweight policy network with reinforcement learning (PPO), AdaGen automatically selects optimal generation parameters conditioned on the current generation state, while keeping the pretrained generator frozen.

Unified MDP Modeling Across Four Paradigms

AdaGen models the scheduling problem of all four major generation paradigms as a Markov Decision Process (MDP). The MDP defines:

State : the current generation step together with intermediate results (partial token sequences for MaskGIT and VAR, partially denoised images for diffusion and flow models).

Action : the set of strategy parameters required by each paradigm (e.g., mask ratio, sampling temperature, guidance scale for MaskGIT; ODE time step and guidance scale for diffusion/flow; temperature and guidance for autoregressive).

Transition : deterministic for diffusion and flow (ODE solver) and stochastic for mask‑based and autoregressive models.

Reward : evaluated only at the final step using a quality assessment function r(x).

The policy network is treated as an RL agent that observes the state, outputs actions, and is optimized with PPO to maximize the final‑step reward.

Adversarial Reward Modeling to Prevent Shortcutting

The authors explore three reward designs:

Using FID directly as reward—produces low FID (e.g., 2.56) but poor visual fidelity because the policy learns to game the metric.

Using a pretrained reward model—improves fidelity but leads to severe mode collapse and low diversity.

Adversarial reward (AdaGen’s approach)—introduces a discriminator that distinguishes real from generated images, forming a GAN‑like game that balances fidelity and diversity.

Empirical results show that the adversarial reward achieves a good trade‑off, avoiding the pitfalls of the other two designs.

Action Smoothing for Stable Exploration

When the number of generation steps increases (e.g., T from 8 to 32), the action space expands dramatically, causing instability in PPO training due to high‑frequency noise added to each step. The paper identifies that optimal policies for iterative generation are smooth over time. To enforce smoothness, AdaGen applies an exponential moving average (EMA) filter to the raw policy output a_t: a_t^{smooth}=\beta\,a_{t-1}^{smooth}+(1-\beta)\,a_t This operation acts as a low‑pass filter (suppressing high‑frequency fluctuations) while preserving causality, thus maintaining the Markov property of the MDP. Experiments demonstrate that action smoothing reduces FID from 3.5 to 2.3 and stabilizes training.

Training Loop

The training consists of two alternating steps:

Policy Network Optimization : generate images with the current policy, compute rewards, and update the policy via PPO.

Reward Model (Discriminator) Optimization : sample real and generated images, train the discriminator to better separate them. The two steps form an adversarial training loop similar to GANs.

Experimental Results

AdaGen is evaluated on ImageNet (256×256) across four generation paradigms and six models. Key findings:

Across all paradigms and inference step counts, AdaGen consistently outperforms the corresponding baselines.

Quality gains are more pronounced at fewer inference steps, with FID improvements ranging from 17% to 54%.

Efficiency gains of 1.6× to 3.6× in inference speed are achieved, while the policy network adds only 0.07%–0.40% extra compute.

Figures in the paper illustrate the quality‑efficiency frontier, showing that AdaGen pushes both dimensions forward for diffusion, autoregressive, mask‑based, and flow models.

Conclusion

AdaGen transforms generation‑strategy design from a handcrafted art into a data‑driven optimization problem. By unifying the scheduling problem as an MDP, employing adversarial reward modeling, and introducing action smoothing, AdaGen delivers substantial quality and speed improvements with minimal overhead, highlighting the importance of adaptive scheduling in modern image synthesis.

Image GenerationReinforcement learningMDPaction smoothingadaptive policyadversarial reward
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.