Breaking the ‘See‑then‑Think’ Barrier: Real‑Time ‘See‑and‑Think’ for VLMs (CVPR 2026)
The paper introduces TaYS (Think‑as‑You‑See), a streaming chain‑of‑thought framework that replaces the traditional “watch‑then‑think” video inference pipeline with a parallel, real‑time “watch‑and‑think” approach, dramatically reducing latency and improving accuracy on complex video reasoning tasks.
Large vision‑language models (VLMs) excel at offline video analysis but struggle in real‑time scenarios because they follow a “watch‑then‑think” pipeline: full video → unified encoding → inference → answer, leading to uncontrolled latency (TTFT) and evidence mismatch as early cues are drowned in long sequences.
Why Chain‑of‑Thought (CoT) hurts streaming inference
Integrating CoT into existing streaming methods forces the model into prolonged reasoning steps, occupying the generation channel and preventing new frames from being processed; interrupting CoT loses reasoning continuity, while not interrupting uses stale evidence, making real‑time CoT infeasible.
TaYS: Think‑as‑You‑See
The proposed TaYS framework rewrites inference as a truly streaming process that grows with each incoming frame. It achieves this through three key engineering innovations:
Streaming attention mask : inference tokens can only attend to frames that have already arrived, preventing “future‑leakage” and ensuring causal temporal reasoning.
Decoupled positional encoding : separates physical time order of video frames from logical order of reasoning tokens, giving visual tokens and reasoning tokens independent position indices and avoiding cross‑modal indexing conflicts.
Dual KV‑Cache : maintains separate visual and reasoning key‑value caches. The visual KV‑Cache continuously writes new frame features, while the reasoning KV‑Cache generates the chain‑of‑thought and answers in parallel, allowing the model to consume new frames while thinking.
Experimental results
Evaluated on mainstream models such as Qwen2.5‑VL, TaYS consistently outperforms batch‑processing baselines and naïve interleaved streaming baselines on tasks requiring dynamic event understanding, causal inference, and topic comprehension. It achieves higher accuracy and dramatically lowers TTFT, yielding lower and more stable end‑to‑end latency.
Ablation studies confirm that removing any of the three components degrades performance: without dual KV‑Cache latency rebounds; without decoupled positional encoding temporal reasoning errors increase; without streaming mask the model “peeks” at future frames, violating real‑time constraints.
Implications
TaYS moves VLMs from offline analysis toward online intelligence, enabling applications such as robot/embodied agents that issue timely commands, surveillance systems that issue real‑time alerts, and live‑stream/education platforms that provide instant summaries and interactive Q&A. The authors argue that streaming reasoning is likely to become the default paradigm for next‑generation multimodal systems.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
