Artificial Intelligence 8 min read

Why Traditional Video Captions Fail and How MTSS Solves the Problem

The article introduces Multi-Stream Scene Script (MTSS), a structured JSON‑based video description paradigm that replaces monolithic captions, explains its design principles, compares its advantages, and presents experimental evidence showing significant gains in both video understanding and generation tasks.

Machine Heart

Apr 27, 2026

Why Traditional Video Captions Fail and How MTSS Solves the Problem

Tencent Hunyuan's team proposes Multi-Stream Scene Script (MTSS), a new video description paradigm that upgrades the traditional "one‑paragraph caption" into a multi‑stream structured script, guided by the principles of Stream Factorization and Relational Grounding.

Traditional monolithic captions suffer from semantic redundancy and ambiguity (e.g., repeated references to "the man in a suit" may refer to different characters), poor scalability (changing a detail requires rewriting the whole paragraph), and are unfriendly to small models, whose performance drops sharply compared to large models.

MTSS addresses these issues by representing a video as a JSON script composed of four parallel streams: Reference Stream (asset information), Event Stream (what happens), Shot Stream (how it is presented), and Global Stream (overall context). Stream Factorization isolates these aspects, while Relational Grounding links them through identity anchoring (global entity references) and temporal anchoring (alignment across tracks), enabling local edits without breaking global consistency.

Compared with monolithic captions, MTSS aligns with the intrinsic structure of video data, provides global identity consistency, is easier to extend and understand, and supports professional editing techniques such as ReactionShot, L‑Cut, and J‑Cut.

In video‑understanding experiments, the team evaluated zero‑shot prompting (directly outputting MTSS) and supervised fine‑tuning on MTSS data. Results show that zero‑shot prompting already yields consistent improvements, especially for small models, while fine‑tuning further amplifies performance. The inference boost from MTSS exceeds the gain on the captioning task alone, and MTSS acts as a "cognitive scaffold" that narrows the gap between models.

Zero‑shot prompting brings universal gains.

MTSS reduces cognitive load, benefiting small models.

Supervised fine‑tuning releases the full potential of the MTSS design.

Inference improvements far surpass caption‑only gains.

MTSS serves as a scaffold that compresses model differences.

For video generation, the researchers adapted the open‑source LTX‑2 model with Shot‑Aware Structured Attention and an Identity Customization module. The MTSS‑driven pipeline demonstrated stronger multi‑shot generation, markedly better ID consistency, and clearer audio‑visual synchronization thanks to explicit "line" and "description" fields in the Event Stream.

In conclusion, MTSS is more than a caption format; it functions as a friendly "cognitive scaffold" that brings video data closer to its natural structure, enabling controllable, long, multi‑shot audio‑visual generation and pointing to a promising data‑engineering direction for the next generation of video models.