Artificial Intelligence 10 min read

Can AI Generate the Next Step in a Video? Inside the VANS Model

Researchers from Kuaishou and Hong Kong City University introduce VANS, a novel Video-as-Answer system that predicts and visualizes the next event in a video by jointly optimizing a visual language model and a video diffusion model, enabling personalized step‑by‑step guidance and future scenario generation.

AI Frontier Lectures

Nov 28, 2025

Can AI Generate the Next Step in a Video? Inside the VANS Model

VANS (Video‑as‑Answer) is a research framework that answers visual questions with a short video segment rather than text. Given an input video and a natural‑language query (e.g., “what should happen next?”), VANS generates a coherent video depicting the predicted future event.

Task definition

The authors introduce Video‑Next Event Prediction: the model must understand the visual context, reason about causal or logical continuations, and synthesize a video that is both semantically correct and visually realistic.

Model architecture

VANS consists of two modules:

Visual Language Model (VLM) : encodes the input video, processes the query, and produces a concise textual title describing the next event.

Video Diffusion Model (VDM) : conditioned on low‑level visual features of the original video and the VLM‑generated title, it synthesizes the answer video.

Joint‑GRPO (Joint Group‑Relative Policy Optimization) jointly trains VLM and VDM via reinforcement learning, avoiding a simple pipeline.

Two‑stage joint optimization

Stage 1 – Visual‑friendly VLM : VDM is frozen. For each generated title, VDM creates a video and a composite reward is computed:

Textual reward: semantic similarity between generated and ground‑truth titles.

Video reward: visual similarity (e.g., CLIP‑T) between generated and ground‑truth videos.

Back‑propagation of this reward forces VLM to produce titles that are accurate and easy for VDM to visualize.

Stage 2 – Precise VDM : The optimized VLM is frozen as an anchor. VDM is trained with a composite reward:

Video quality reward (low FVD, high perceptual fidelity).

Semantic alignment reward ensuring the video matches the anchor title.

This prevents VDM from copying the input or generating unrelated content and aligns visual output tightly with VLM reasoning.

Experimental evaluation

Quantitative results on three core metrics show VANS outperforms strong baselines (e.g., Omni‑Video):

ROUGE‑L (textual event prediction) – nearly three‑fold improvement.

CLIP‑T (semantic fidelity of generated video) – substantial gain.

FVD (visual quality) – lowest score, indicating realistic and smooth videos.

Qualitative analysis on a cooking example demonstrates that baseline models either predict the wrong action or visualize it incorrectly, while VANS correctly infers “sprinkle grated cheese” and generates a video showing a hand scattering cheese particles.

Resources

Project page: https://video-as-answer.github.io/

GitHub repository: https://github.com/KlingTeam/VANS

arXiv paper: https://arxiv.org/abs/2511.16669

Code example

收
藏
，
分
享
、
在
看
，
给
个
三
连
击呗！

multimodal AI Video Generation joint optimization future prediction programmatic teaching

Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.