WorldCache Boosts Video World Model Inference Up to 3.7× with Near‑Lossless Quality
WorldCache separates cacheable and recomputable tokens in diffusion world models using curvature‑based classification and a chaotic‑prioritized adaptive skipping schedule, achieving up to 3.7× speedup on HunyuanVoyager‑13B and Aether‑5B without extra memory or retraining while preserving visual quality.
Diffusion world models are hard to accelerate because they generate multimodal outputs (RGB, depth, camera trajectory) and tokens evolve at heterogeneous rates; treating all tokens and timesteps uniformly either wastes computation on easy tokens or accumulates error on difficult ones.
WorldCache addresses this by first estimating each token’s trajectory curvature from the three most recent full forward passes, converting speed and acceleration into a curvature score. Tokens are then grouped into Stable (low curvature), Linear (moderate curvature), and Chaotic (high curvature), each receiving a different caching rule: direct reuse, linear extrapolation, or Hermite‑weighted damped update respectively.
The second component, Chaotic‑prioritized Adaptive Skipping , monitors only the Chaotic tokens. By normalising curvature‑based feature differences into a dimensionless drift metric, the system triggers a full recomputation precisely when a critical token begins to diverge, avoiding unnecessary full passes on stable periods.
Experiments on the image‑to‑world task of HunyuanVoyager‑13B show end‑to‑end latency dropping from 1053.7 s to 288.6 s (3.65× faster) while Dynamic WorldScore remains 45.43 (baseline 46.40), PSNR 23.49 and LPIPS 0.176; memory usage stays at 50.58 GB versus 50.44 GB baseline. On Aether‑5B, latency falls from 180.5 s to 107.2 s (1.68×) with Dynamic WorldScore 44.72, PSNR 31.87, SSIM 0.924, LPIPS 0.066, and memory at 46.59 GB. In a 3D reconstruction setting, latency reduces from 55.42 s to 21.20 s (2.61×) while preserving Abs Rel 0.341, RPE trans 0.068 and achieving the lowest rotation error of 0.796.
Thus WorldCache demonstrates that respecting the intrinsic multimodal coupling, spatial variance, and non‑uniform temporal dynamics of world models enables substantial inference acceleration without additional training or memory overhead, opening a path toward more interactive and longer‑horizon simulation applications.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
