When Swapping Two Images Breaks VLMs: EgoTSR Enables Robots to Judge Real Task Progress

The paper reveals that visual language models often rely on chronological bias, mistaking later frames for progress, and introduces EgoTSR—a 46‑million‑sample ego‑centric dataset and three‑stage curriculum that teaches models to assess task state, evaluate with forward‑reverse tests, and achieve over 92% accuracy on long‑term robotic tasks.

Machine Heart
Machine Heart
Machine Heart
When Swapping Two Images Breaks VLMs: EgoTSR Enables Robots to Judge Real Task Progress

Problem: Chronological Bias in Visual Language Models

Robotic scenarios such as a manipulator picking up a cup and then dropping it illustrate that a later video frame does not necessarily mean the task has advanced. Human observers can easily detect this regression, but many visual language models (VLMs) trained on chronologically ordered robot videos assume that later frames are closer to task completion, leading to incorrect judgments.

EgoTSR: Dataset and Goal

A research team from Zhejiang University and four other institutions proposes EgoTSR (Ego‑centric Task‑oriented Spatiotemporal Reasoning). The goal is to train VLMs to determine which of two images from the same task video reflects a state nearer to the goal, and to extend this ability to long‑range planning. EgoTSR‑Data contains 46 million ego‑centric samples.

Three‑Stage Curriculum Learning

The training follows a curriculum that mirrors the development of robot capabilities:

Stage 1 (≈15 M CoT samples): The model first describes the spatial state of each image, compares necessary actions, and selects the image that shows more progress, establishing a link between visual state, task goal, and judgment.

Stage 2 (≈16 M Tag samples): Detailed reasoning text is removed, leaving only the image pair, task description, and correct label, encouraging faster direct state assessment.

Stage 3 (≈15 M LongTag samples): The model learns to reason over long‑range tasks, bringing the total to 46 M samples.

Mitigating Chronological Bias

The authors call the model’s reliance on input order “chronological bias.” To expose this shortcut, they feed each image pair twice: once in forward order (A, B) and once in reverse order (B, A). If the correct answer switches accordingly, the model is truly reasoning about task state; if it always picks the second image, it is merely guessing based on position.

In experiments, InternVL‑8B achieves 99 % accuracy on forward inputs but drops to ≈2 % when the order is reversed, demonstrating severe bias.

Dual‑Level Evaluation Framework

The team introduces a two‑level benchmark:

Short‑term (atomic) level: Tests fine‑grained spatial changes such as whether a gripper is closed or an object entered a container, diagnosing “seeing‑wrong.”

Long‑term level: Requires the model to infer overall task progress by combining sub‑task sequences, diagnosing “thinking‑wrong.”

Both levels incorporate forward and reverse tests. Results show an average long‑term accuracy of 92.4 % and short‑term accuracy of about 88 % , with only a 0.1 % gap between forward and reverse long‑term tests.

Subtask Planner and Task Decomposition

EgoTSR adds a Subtask Planner that converts a high‑level instruction (e.g., “open the fridge, take out a drink, place it on the table, and close the fridge”) into an ordered list of atomic actions. This decomposition provides a “logical skeleton” for the model to locate each image within the task timeline.

The model therefore judges not only object presence but also which stage of the task each image represents and what actions remain.

Ablation Studies

Mixing CoT, Tag, and LongTag data without curriculum reduces long‑task accuracy to 69.6 % . Following the explicit‑reasoning → ability‑internalization → long‑range planning sequence raises it to 92.4 % . Removing the Subtask Planner drops accuracy to 81.1 % .

Real‑World Validation

Beyond benchmarks, the authors test EgoTSR on human‑operated videos, simulation environments, and real robot platforms (LIBERO, SIMPLER, RoboTwin, Franka, Agibot, So‑100). In a cup‑placement case, the model processes an uninterrupted video and outputs a task‑completion curve that rises sharply when key sub‑tasks (grasp, place) succeed and stays flat during transport.

This demonstrates that EgoTSR can monitor task progress over long videos, not just compare static image pairs.

Conclusion

EgoTSR provides a curriculum‑driven path from explicit reasoning to long‑range planning and offers a stricter “ruler” for embodied VLMs by using forward‑reverse image pairs. It raises the critical question: when a model claims to understand robot videos, is it truly reasoning about causality between objects, actions, and goals, or merely exploiting the shortcut that later frames usually indicate progress?

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

roboticscurriculum-learningvisual-language-modelschronological-biasego-centric reasoningtask-progress-evaluation
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.