ICML 2026: MedScope Introduces a New Paradigm for Long Medical Video Reasoning—From Watching to Verifying

MedScope proposes a "Think with Videos" paradigm that lets AI models actively locate and verify evidence in long clinical videos, using coarse‑to‑fine tool calling, evidence‑centric training data (ClinVideoSuite) and a grounding‑aware reinforcement learning objective, achieving superior performance on multiple video‑understanding benchmarks.

Data Party THU
Data Party THU
Data Party THU
ICML 2026: MedScope Introduces a New Paradigm for Long Medical Video Reasoning—From Watching to Verifying

Problem

Clinical video streams such as surgery, endoscopy, and interventional procedures are long and contain sparse visual evidence: a decisive cue may appear for only a few seconds within hours of footage. Fixed‑frame sampling often misses these cues, making it impossible for a model to justify its answer with concrete visual evidence.

Think with Videos Paradigm

The proposed "Think with Videos" paradigm decomposes a video‑question pair into a multi‑turn process. The model first hypothesizes missing evidence, then invokes tools to retrieve candidate segments or key frames, and finally revises its judgment based on the newly observed visual evidence.

Method 1 – Coarse‑to‑Fine Tool Calling

Two primitive tools are provided: crop_video(start, end): extracts a temporal interval from the video. get_frame(timestamp): returns a key frame at a specified time.

The model first uses a coarse search to locate a candidate interval, then calls crop_video on that interval and get_frame for fine‑grained verification. This "purpose‑driven" frame selection replaces indiscriminate frame increase.

Method 2 – ClinVideoSuite

Training data are upgraded from plain QA pairs to evidence‑aligned triples. ClinVideoSuite supplies video‑question‑answer samples together with explicit evidence windows and the required tool‑calling sequence. Multi‑level filtering removes questions answerable by commonsense, global summaries, or internal inconsistency, ensuring that remaining samples truly depend on visual evidence. The resulting dataset binds question, answer, and evidence window, providing supervision for both answer generation and evidence retrieval.

Method 3 – GA‑GRPO (Grounding‑Aware Group Relative Policy Optimization)

Standard RL rewards only final answer correctness, which can be achieved by guessing. GA‑GRPO extends the reward to three components:

Answer correctness.

Format compliance.

Evidence reward measuring alignment between predicted and ground‑truth evidence windows.

An additional IoU bonus is applied to the crop_video tool to encourage precise temporal grounding.

Experimental Results

MedScope‑7B‑RL was evaluated on SVU‑31K, ClinVideo‑Eval, and related benchmarks covering full‑video description, fine‑grained understanding, temporal reasoning, perception reasoning, temporal grounding, and grounded VQA. It achieved the highest overall scores among open‑source models and demonstrated strong cross‑domain generalization on clinical video tasks.

Ablation studies show that removing the evidence reward or the IoU bonus significantly degrades temporal grounding performance, confirming the necessity of evidence‑centric rewards.

Impact

The system enables a medical video agent to request evidence, invoke tools to retrieve it, locate the relevant segment, and present a visual justification to human experts. This traceable evidence capability is essential for surgical training, postoperative review, quality control, robotic assistance, and real‑time decision support.

Paper: https://arxiv.org/abs/2602.13332 Code: https://github.com/SII-WenjieLisjtu/MedScope

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Reinforcement LearningTool CallingMultimodal LLMEvidence-based QALong Video ReasoningMedical Video AI
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.