AnyFlow: Generate High‑Quality Video in 4 Steps and Keep Improving with More Sampling

AnyFlow introduces a flow‑map distillation framework that enables video diffusion models to produce high‑quality results in just four sampling steps while still gaining quality as the number of steps increases, supporting both causal and bidirectional architectures and scaling up to 14 B parameters.

AIWalker
AIWalker
AIWalker
AnyFlow: Generate High‑Quality Video in 4 Steps and Keep Improving with More Sampling

1. Background: Video generation needs speed and quality on demand

Video diffusion models can generate high‑quality video but usually require many sampling steps, making inference costly. Existing few‑step methods use consistency distillation to produce results within four steps. However, they are optimized for a fixed small number of steps; increasing steps at test time does not guarantee quality improvement and may even degrade it, limiting users from freely switching between fast preview and high‑quality output.

AnyFlow addresses this limitation by asking whether a single model can generate good results in four steps and continue to improve when run for 16 or 32 steps.

AnyFlow test-time scaling: quality keeps improving as sampling steps increase
AnyFlow test-time scaling: quality keeps improving as sampling steps increase

Figure 1: AnyFlow’s test‑time scaling compared with Self‑Forcing and rCM; AnyFlow maintains high quality at few steps and continues to improve as steps increase.

2. Method: Core idea, forward training and backward trajectory decomposition

Core idea: From “endpoint mapping” to “any‑time transition”

Traditional consistency distillation learns a mapping from an intermediate latent z_t directly to the final latent z_0. This works for few‑step generation but, when applied to a model pretrained with flow matching, it alters the original sampling trajectory, weakening multi‑step scalability. As shown in Figure 1, methods such as rCM and Self‑Forcing lose performance as the number of steps grows.

AnyFlow replaces this with Flow Map Distillation: the model learns transitions between any two time points, i.e., from z_t to z_r. Consequently the model can make large jumps for few‑step sampling and fine‑grained refinements for more steps, optimizing the entire sampling trajectory rather than a single fixed step count.

Comparison of Consistency Distillation and Flow Map Distillation
Comparison of Consistency Distillation and Flow Map Distillation

Figure 2: AnyFlow shifts the distillation target from endpoint consistency mapping to learning flow‑map transitions between arbitrary time points, preserving a more complete sampling trajectory.

Forward training: Provides initialization for any‑step generation but is insufficient alone

AnyFlow first performs Forward Flow Map Training, converting a pretrained video diffusion model into a flow‑map model that offers a stable initialization for arbitrary‑step sampling.

The paper notes that forward training alone cannot fully solve test‑time issues. During inference the model repeatedly rolls out its own previous states, while forward training only learns local mappings on the teacher’s trajectory, creating a mismatch. This leads to discretization error in few‑step sampling and exposure bias in causal video generation.

Therefore AnyFlow incorporates On‑Policy Distillation (OPD) to correct the model on its own sampling trajectory.

Qualitative Ablation of On‑Policy Distillation
Qualitative Ablation of On‑Policy Distillation

Figure 3: Forward Flow Map Training alone shows discretization error and exposure bias; adding On‑Policy Flow Map Distillation markedly reduces these test‑time errors.

Backward trajectory decomposition: Flow Map Backward Simulation to correct rollout

During OPD the model must generate its own sampling states, which is computationally expensive for full rollout. AnyFlow leverages the compositional property of flow maps to decompose a full Euler trajectory into shortcut transitions, e.g., from z_T to z_t, then to z_r, and finally to z_0.

This design offers two benefits: (1) test‑time inference can reuse the original Euler trajectory without extra consistency sampling; (2) the decomposition adapts to different step sizes, reducing the cost of multi‑step generation. After Flow‑Map OPD training, test‑time error drops noticeably for both few‑step and autoregressive scenarios.

Comparison of Backward Simulation Paradigms
Comparison of Backward Simulation Paradigms

Figure 4: Flow Map Backward Simulation splits long rollouts into shortcut segments, enabling more efficient simulation for different inference step counts compared with Consistency Backward Simulation.

3. Experiments: Scaling from 1.3 B to 14 B parameters

The paper evaluates AnyFlow on both bidirectional and causal video diffusion architectures, covering model sizes from 1.3 B to 14 B parameters, demonstrating that the approach scales beyond small models.

Causal video generation: AnyFlow‑FAR‑Wan2.1‑14B

AnyFlow combined with the FAR causal backbone (AnyFlow‑FAR) supports text‑to‑video (T2V), image‑to‑video (I2V) and video‑to‑video (V2V) generation within a single model.

AnyFlow‑FAR‑Wan2.1‑14B produces high‑quality T2V results with only 4 NFEs, and quality continues to rise when more sampling steps are used.

Qualitative comparisons show superior motion stability, subject clarity and detail consistency compared with several few‑step baselines, especially in challenging scenes such as vehicle motion, running, and complex dynamics where baseline methods exhibit blur, flicker or unnatural motion.

Causal video generation comparison (14B)
Causal video generation comparison (14B)

Figure 6: 14B causal T2V results. AnyFlow‑FAR‑Wan2.1‑14B uses 4 NFEs and outperforms LightX2V, FastVideo and Krea‑Realtime.

On the I2V task, AnyFlow‑FAR‑Wan2.1‑14B achieves 87.87 VBench‑I2V with 4 NFEs, comparable to Wan2.1‑I2V‑14B that uses 50 × 2 NFEs (87.71), indicating strong first‑frame consistency and video quality even with extreme step reduction.

I2V VBench results
I2V VBench results

Figure 7: I2V VBench results. AnyFlow‑FAR‑Wan2.1‑14B reaches 87.87 with 4 NFEs, slightly surpassing the larger‑budget baseline.

Bidirectional video generation: AnyFlow‑Wan2.1‑T2V‑14B

Applied to the Wan2.1‑T2V backbone, AnyFlow‑Wan2.1‑T2V‑14B maintains good visual quality and natural motion under few‑step sampling, outperforming the rCM baseline in visual detail stability.

These results confirm that Flow Map Distillation does not depend on a specific video architecture; it works for both causal and bidirectional diffusion and scales reliably to 14 B parameters.

Bidirectional video generation comparison (14B)
Bidirectional video generation comparison (14B)

Figure 8: 14B bidirectional T2V results. AnyFlow‑Wan2.1‑T2V‑14B with 4 NFEs yields more natural and stable videos than rCM‑Wan2.1‑T2V‑14B.

4. Fine‑tuning on any‑step models

Because the flow map retains multi‑scale flow fields, the distilled AnyFlow model can be further fine‑tuned on downstream datasets while preserving its few‑step capability. This is valuable for domain‑specific video generation such as robotics, autonomous driving, or game scenes where identity, trajectory or style consistency is critical.

Fine‑tuning on downstream data
Fine‑tuning on downstream data

Figure 9: After fine‑tuning, the model shows more stable performance on specialized scenes such as robot subject preservation and pedestrian trajectory consistency.

Summary

AnyFlow proposes a new distillation framework for “any‑step” video generation. Instead of optimizing only a fixed few steps, it learns the full sampling trajectory via flow‑map distillation and corrects rollout errors through on‑policy flow‑map distillation. The result is a model that is fast (4 steps) yet continues to improve with more steps, works for both causal and bidirectional diffusion, and scales up to 14 B parameters.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large-scale modelsvideo diffusionon-policy distillationfew-step generationflow map distillationbidirectional videocausal video
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.