Can AI Generate the Next Step in a Video? Inside the VANS Model
Researchers from Kuaishou and Hong Kong City University introduce VANS, a novel Video-as-Answer system that predicts and visualizes the next event in a video by jointly optimizing a visual language model and a video diffusion model, enabling personalized step‑by‑step guidance and future scenario generation.
VANS (Video‑as‑Answer) is a research framework that answers visual questions with a short video segment rather than text. Given an input video and a natural‑language query (e.g., “what should happen next?”), VANS generates a coherent video depicting the predicted future event.
Task definition
The authors introduce Video‑Next Event Prediction: the model must understand the visual context, reason about causal or logical continuations, and synthesize a video that is both semantically correct and visually realistic.
Model architecture
VANS consists of two modules:
Visual Language Model (VLM) : encodes the input video, processes the query, and produces a concise textual title describing the next event.
Video Diffusion Model (VDM) : conditioned on low‑level visual features of the original video and the VLM‑generated title, it synthesizes the answer video.
Joint‑GRPO (Joint Group‑Relative Policy Optimization) jointly trains VLM and VDM via reinforcement learning, avoiding a simple pipeline.
Two‑stage joint optimization
Stage 1 – Visual‑friendly VLM : VDM is frozen. For each generated title, VDM creates a video and a composite reward is computed:
Textual reward: semantic similarity between generated and ground‑truth titles.
Video reward: visual similarity (e.g., CLIP‑T) between generated and ground‑truth videos.
Back‑propagation of this reward forces VLM to produce titles that are accurate and easy for VDM to visualize.
Stage 2 – Precise VDM : The optimized VLM is frozen as an anchor. VDM is trained with a composite reward:
Video quality reward (low FVD, high perceptual fidelity).
Semantic alignment reward ensuring the video matches the anchor title.
This prevents VDM from copying the input or generating unrelated content and aligns visual output tightly with VLM reasoning.
Experimental evaluation
Quantitative results on three core metrics show VANS outperforms strong baselines (e.g., Omni‑Video):
ROUGE‑L (textual event prediction) – nearly three‑fold improvement.
CLIP‑T (semantic fidelity of generated video) – substantial gain.
FVD (visual quality) – lowest score, indicating realistic and smooth videos.
Qualitative analysis on a cooking example demonstrates that baseline models either predict the wrong action or visualize it incorrectly, while VANS correctly infers “sprinkle grated cheese” and generates a video showing a hand scattering cheese particles.
Resources
Project page: https://video-as-answer.github.io/
GitHub repository: https://github.com/KlingTeam/VANS
arXiv paper: https://arxiv.org/abs/2511.16669
Code example
收
藏
,
分
享
、
在
看
,
给
个
三
连
击呗!How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
