AnyFlow: Generate High‑Quality Video in 4 Steps with Unlimited Sampling Improvement

AnyFlow introduces a flow‑map distillation framework that enables video diffusion models to produce high‑quality results in just four steps while continuously improving with additional sampling steps, supporting both causal and bidirectional architectures up to 14 B parameters and allowing downstream fine‑tuning.

AIWalker
AIWalker
AIWalker
AnyFlow: Generate High‑Quality Video in 4 Steps with Unlimited Sampling Improvement

1. Background: Fast and Scalable Video Generation

Video diffusion models can generate high‑quality clips but typically require many sampling steps, leading to high inference cost. Existing few‑step methods use consistency distillation to produce results in four steps, yet they are optimized for a fixed step count; increasing steps does not guarantee quality gains and may even degrade performance, making it hard for users to switch between quick previews and high‑quality outputs.

AnyFlow addresses this limitation by asking whether a single model can generate good results in four steps and continue to improve when run for 16, 32, or more steps.

AnyFlow test-time scaling: quality improves with more sampling steps
AnyFlow test-time scaling: quality improves with more sampling steps

2. Method: Core Idea, Forward Training, and Backward Trajectory Decomposition

Core Idea: From Endpoint Mapping to Arbitrary‑Time Transitions

Traditional consistency distillation learns a direct mapping from an intermediate latent z_t to the final latent z_0. This works for few‑step generation but, when applied to a flow‑matching pretrained model, it alters the original sampling trajectory and weakens multi‑step scalability, as shown by the performance drop of rCM and Self‑Forcing when more steps are used.

AnyFlow replaces endpoint‑only learning with Flow Map Distillation : the model learns to map between any two time points, i.e., from z_t to z_r. Consequently, the model can make large jumps in few‑step mode and perform fine‑grained refinements when more steps are allocated, optimizing the entire sampling trajectory rather than a single step count.

Comparison of consistency distillation and flow‑map distillation
Comparison of consistency distillation and flow‑map distillation

Forward Training: Providing Any‑Step Initialization

AnyFlow first performs forward flow‑map training, converting a pretrained video diffusion model into a flow‑map model that learns transitions between arbitrary time pairs. This supplies a stable initialization for any‑step sampling.

However, the paper notes that forward training alone cannot fully close the train‑test gap. During inference the model rolls out its own generated states, while forward training only learns local mappings on the teacher’s trajectory, leading to discretization error in few‑step sampling and exposure bias in causal generation.

Therefore, AnyFlow adds On‑Policy Distillation (OPD) to correct the model on its own rollout trajectory.

Qualitative ablation of on‑policy distillation
Qualitative ablation of on‑policy distillation

Backward Trajectory Decomposition: Flow‑Map Backward Simulation

During OPD the model must generate its own sampling states, but full rollout is computationally expensive. AnyFlow leverages the compositional property of flow maps to decompose a long Euler trajectory into shortcut transitions, e.g., z_T → z_t → z_r → z_0. This yields two benefits: (1) test‑time inference can reuse the original Euler trajectory without extra consistency sampling, and (2) the decomposition adapts to different step sizes, reducing the cost of multi‑step rollout.

Comparison of backward simulation paradigms
Comparison of backward simulation paradigms

3. Experiments: Scaling from 1.3 B to 14 B Parameters

The authors evaluate AnyFlow on both bidirectional and causal video diffusion backbones, covering model sizes from 1.3 B to 14 B parameters, demonstrating that the approach scales to large models.

AnyFlow text‑to‑video VBench results
AnyFlow text‑to‑video VBench results

Causal Generation: AnyFlow‑FAR‑Wan2.1‑14B

Combined with the FAR backbone, AnyFlow‑FAR generates high‑quality text‑to‑video (T2V) results with only 4 NFEs, and quality continues to rise as more sampling steps are added. Visual comparisons show superior motion stability, subject clarity, and detail consistency over baseline few‑step methods, especially in challenging scenarios such as vehicle motion and running.

Causal 14B video generation comparison
Causal 14B video generation comparison

For image‑to‑video (I2V), AnyFlow‑FAR‑Wan2.1‑14B achieves 87.87 VBench‑I2V score with 4 NFEs, comparable to the 14B baseline that uses 50 × 2 NFEs, indicating strong first‑frame consistency and overall video quality even with extreme step reduction.

I2V VBench results
I2V VBench results

Bidirectional Generation: AnyFlow‑Wan2.1‑T2V‑14B

Applied to the bidirectional Wan2.1‑T2V backbone, AnyFlow‑Wan2.1‑T2V‑14B maintains high visual quality and natural motion under few‑step sampling and outperforms the rCM baseline in visual detail stability.

Bidirectional 14B video generation comparison
Bidirectional 14B video generation comparison

4. Fine‑Tuning on Downstream Data

Because the flow map retains multi‑granularity flow fields, the distilled AnyFlow model can be further fine‑tuned on domain‑specific video datasets while preserving its few‑step capability. This is valuable for vertical applications such as robotics, autonomous driving, or game scenes where identity, trajectory, or style consistency must be maintained.

Fine‑tuning AnyFlow on downstream data
Fine‑tuning AnyFlow on downstream data

Summary

AnyFlow proposes a new distillation framework for “any‑step” video generation. By learning full‑trajectory flow maps and correcting rollout errors through on‑policy flow‑map distillation, it achieves fast four‑step generation that continues to improve with more steps, works for both causal and bidirectional diffusion backbones, and scales up to 14 B parameters.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI video generationvideo diffusionany-step samplingconsistency distillationflow map distillationlarge video models
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.