Proactive Interaction for Video Multimodal Models: MMDuet2 & ProactiveVideoQA

This article surveys the ICLR 2026 papers ProactiveVideoQA and MMDuet2, detailing how video multimodal large models can decide when to reply autonomously, the PAUC benchmark for evaluating timeliness and accuracy, a reinforcement‑learning training pipeline that requires no precise timestamps, and experimental findings on data construction, frame‑sampling density, and SOTA performance.

Machine Heart
Machine Heart
Machine Heart
Proactive Interaction for Video Multimodal Models: MMDuet2 & ProactiveVideoQA

This article integrates two papers from Peking University’s Wang Xuan Institute— ProactiveVideoQA and MMDuet2 —to explain how video multimodal large language models (MLLMs) can achieve proactive interaction , i.e., deciding autonomously when to respond during video playback instead of waiting for a user query.

Background: Why Proactive Interaction?

In typical multimodal assistants, users must repeatedly ask questions while cooking or performing other tasks. Proactive interaction aims to let the model observe the video and provide explanations without explicit prompts, which is crucial for scenarios such as live‑stream management, intelligent surveillance, and first‑person assistants.

ProactiveVideoQA: The First Proactive Interaction Benchmark

ProactiveVideoQA defines the problem and introduces the PAUC (Proactive Area Under Curve) metric, which jointly evaluates response timeliness and correctness. The benchmark contains four video categories (online videos, egocentric videos, TV series, surveillance footage) with 1,377 videos and 1,427 questions, each paired with one or more answers anchored to specific time segments.

Key features of the benchmark:

Multi‑turn open‑ended QA : Unlike most video QA datasets that use multiple‑choice questions, ProactiveVideoQA requires models to generate fully open‑ended responses over multiple turns, mimicking real dialogue.

Diverse tasks and multimodal inputs : Text, video, and audio modalities are combined across the four task types.

PAUC metric : Plots a “time‑quality” curve and computes the area under the curve, rewarding early correct replies and penalising late or inaccurate ones.

MMDuet2: Reinforcement‑Learning Based Proactive Interaction Training

MMDuet2 achieves state‑of‑the‑art (SOTA) performance on the ProactiveVideoQA benchmark without requiring precise annotation of the optimal reply time. Its contributions are:

High‑quality training data : A 52k‑example video‑dialogue dataset split into two dialogue types—1QnA (single question at video start) and nQnA (multiple random‑time questions).

Training and inference compatibility : Built on the Qwen2.5‑VL model; each turn outputs either a textual reply or the token "NO REPLY", requiring no extra modules or hand‑tuned thresholds.

Multi‑turn RL training : Uses a GRPO‑based reinforcement‑learning algorithm with a composite reward function derived from PAUC, penalising repeated replies, off‑topic replies, and prefix duplication. The total reward is a weighted sum of these four components.

Construction of the Proactive Interaction Dataset

The dataset is built through three steps:

Scene segmentation and captioning : Videos are split into scenes; each segment receives a detailed caption.

QA generation : An LLM generates questions and corresponding answers for each caption. If a segment cannot be answered, the answer is "NO REPLY".

Dialogue construction : Two dialogue formats are created:

1QnA : One question at video start, model replies within the associated segment.

nQnA : Multiple random‑time questions; the model must immediately answer the current segment and continue answering subsequent segments until the next question.

Chat Template for Proactive Interaction

The interaction follows a custom chat template:

A system prompt defines the proactive dialogue rules.

The user sends a message containing 1–2 video frames plus optional text.

The assistant may generate a textual reply or output "NO REPLY".

The loop repeats until all frames of the video are processed.

Time stamps for each turn are derived from the number of frames multiplied by the frame interval (e.g., 1 fps: user at second 2, assistant replies at second 4, etc.).

SFT and RL Two‑Stage Training

During Supervised Fine‑Tuning (SFT), MMDuet2 is initialized from Qwen2.5‑VL 3B and trained on the 52k proactive dialogues plus 25k offline video QA and 25k captioning examples to retain general video understanding. Training runs on eight H800 GPUs for eight hours, with reply times set to the end of each segment to avoid hallucination.

SFT alone suffers from two limitations: (1) the model learns to delay replies until the segment ends, and (2) the high frequency of "NO REPLY" in the data makes the model overly conservative during inference.

The RL stage addresses these issues by applying the GRPO algorithm and the composite PAUC‑based reward. Sampling is performed on 20–60 s video clips; each clip receives the preceding video context and dialogue history, and the model interacts with two frames per step.

The RL stage processes 1,900 videos on eight H800 GPUs for 20 hours, achieving SOTA results on ProactiveVideoQA.

Impact of Frame‑Sampling Density

Frame‑sampling interval critically affects proactive interaction experience. Experiments show:

SFT stage : With a 1‑second interval the model collapses to always output "NO REPLY" due to data imbalance; a 2‑second interval is used instead.

RL stage : Performance is relatively insensitive to interval changes.

Inference stage : Reducing the interval from 2 s to 1 s yields a significant boost because the model can detect the optimal reply moment earlier (≈1 s ahead), improving PAUC and user experience.

Experimental Results

On the Proactive Output task of StreamingBench and the PAUC metric of ProactiveVideoQA, MMDuet2 attains the best performance while markedly lowering reply‑repetition rates. Compared with prior proactive models (VideoLLM‑Online, MMDuet), which rely on a per‑frame reply‑probability threshold, MMDuet2’s RL‑driven timing avoids the need for a hand‑tuned threshold and reduces both missed replies and redundant outputs.

Offline video understanding benchmarks (Video‑MME, MVBench, LongVideoBench) show that MMDuet2’s performance remains on par with the original Qwen2.5‑VL, confirming that the SFT + RL pipeline does not degrade general video comprehension.

Conclusion and Outlook

Together, ProactiveVideoQA and MMDuet2 provide a complete solution for proactive interaction in video multimodal models: a benchmark with the PAUC metric and a reinforcement‑learning training method that learns optimal reply timing without precise timestamps. Future work will extend proactive interaction to domain‑specific scenarios by constructing specialized training data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

benchmarkreinforcement learningproactive interactionMMDuet2PAUCvideo multimodal
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.