Artificial Intelligence 9 min read

ICML 2026: Enabling Multimodal Large Models to Reason Over Time with the Open‑Source TaRO Framework

The paper introduces the Temporal‑Aware Reasoning Optimization (TaRO) framework, which equips multimodal video large models with time‑aware reasoning via template‑based exploration, a temporal‑sensitivity reward, and progressive curriculum learning, achieving state‑of‑the‑art zero‑shot performance on several video temporal grounding benchmarks, including long‑video datasets.

Machine Heart

Jul 3, 2026

ICML 2026: Enabling Multimodal Large Models to Reason Over Time with the Open‑Source TaRO Framework

Background and Motivation

Video Temporal Grounding (VTG) requires locating the start and end timestamps of events described by natural‑language queries in untrimmed videos. Multimodal large language models (MLLMs) combined with reinforcement learning (RL) can generate reasoning paths, but the reasoning is often superficial and does not identify the visual evidence needed for precise grounding.

Two factors cause this problem:

Inefficient random exploration : Existing RL paradigms explore the large video reasoning space without guidance, leading to random rollouts that follow low‑quality trajectories.

Reward design that ignores reasoning quality : Rewards focus on the final answer (e.g., IoU) and disregard the quality of the reasoning process, allowing shallow reasoning paths to be reinforced.

Technical Solution: Temporal‑Aware Reasoning Optimization (TaRO)

TaRO introduces three components to train multimodal models to think with time.

Constructive Reasoning Exploration : Dense video subtitles with timestamps are pre‑generated. Sampled subtitles are concatenated in chronological order, providing high‑quality guidance that teaches the model which visual cues are essential and which are distractors.

Temporal‑Sensitivity Reward : An instance‑level reward evaluates whether a reasoning path anchors on the correct visual segment. If the path perturbs frames near the true event boundaries, the probability (logit) of the path is penalized, causing the reward signal to drop and forcing the model to generate reasoning tightly coupled with key timestamps.

Progressive Curriculum Learning : Training begins with a warm‑up phase using the constructive exploration data, allowing the model to acquire time‑aware reasoning habits. Afterwards, the model transitions to a free‑exploration phase where the temporal‑sensitivity reward guides autonomous refinement of its reasoning strategy.

Experimental Results

Zero‑Shot Video Temporal Grounding

On four public benchmarks—Charades‑STA, ActivityNet Captions, QVHighlights, and TVGBench—TaRO‑trained models surpass existing state‑of‑the‑art methods. Using Qwen2.5‑VL‑7B‑Instruct as the base model, TaRO improves [email protected] on TVGBench by 8.4% over the baseline. Similar gains are observed with the smaller Qwen2.5‑VL‑3B model and the newer Qwen3‑VL‑8B architecture, demonstrating generality.

Extended Capability on Long Videos

On long‑video datasets TACOS (average length 367 s) and Ego4D NLQ (average length 499 s), TaRO maintains strong performance and outperforms baselines by large margins. With Qwen3‑VL‑8B, TaRO raises [email protected] by 13.7% on TACOS and by 8.7% on Ego4D NLQ, confirming robustness for lengthy videos.

Ablation Study

Adding only the Temporal‑Sensitivity Reward (TR) to a random‑exploration baseline lifts [email protected] from 61.1% to 63.1%, proving its effectiveness. Removing the free‑exploration stage while keeping only Constructive Reasoning Exploration (CRE) causes a severe drop because test‑time inference cannot rely on external subtitles. Introducing Progressive Curriculum (PC) restores and further improves performance, achieving the best results when all three components are combined.

Visualization

A challenging multimodal scenario shows a distracting action (a woman wiping her face) that visually resembles the query “brush face.” TaRO generates fine‑grained intermediate reasoning, correctly anchors the relevant segment from 19.0 s to 37.0 s, discards irrelevant frames, and outputs the accurate temporal prediction.

Conclusion

TaRO addresses the shallow‑reasoning problem of multimodal video models in VTG by introducing constructive reasoning exploration, a temporal‑sensitivity reward, and progressive curriculum learning. Experiments demonstrate marked improvements in reasoning robustness, interpretability, and zero‑shot grounding accuracy across standard and long‑video benchmarks.

Paper: https://arxiv.org/abs/2606.09248v1

Open‑source repository: https://github.com/oceanflowlab/TaRO

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

reinforcement learning Multimodal Learning TaRO Temporal Reasoning Zero-shot Evaluation Video Temporal Grounding

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.