OneStory Enables Minute-Long, Ten-Shot Video Generation with Consistent Narrative

The OneStory paper presented at CVPR 2026 introduces an adaptive‑memory framework for coherent multi‑shot video generation, reformulating the task as next‑shot generation and using Frame Selection and Adaptive Conditioner modules to maintain long‑range context while supporting both text‑to‑multi‑shot and image‑to‑multi‑shot synthesis.

Machine Heart
Machine Heart
Machine Heart
OneStory Enables Minute-Long, Ten-Shot Video Generation with Consistent Narrative

Multi‑shot video generation is a demanding research direction because it must preserve stable elements such as character identity and scene layout across shots while allowing natural narrative changes like viewpoint shifts and action progression. Existing approaches either rely on a fixed‑size window that discards early‑shot information as it slides, or generate keyframes first and then condition each shot on those frames, which limits interaction between shots and hampers the transmission of detailed story information.

OneStory, presented at CVPR 2026 by researchers from Meta and the University of Copenhagen, tackles this core problem by introducing a compact yet global cross‑shot memory mechanism. The authors reformulate multi‑shot generation as a next‑shot generation problem, generating each shot autoregressively based on previously generated shots (shot‑by‑shot generation). This design allows the model to treat video synthesis like storytelling, using the already created context to produce the next segment.

The system builds on a pretrained image‑to‑video foundation model, inheriting strong visual conditional generation capabilities. The first shot can be produced by any text‑to‑video or image‑to‑video model, after which OneStory generates subsequent shots conditioned on the shot prompts and the adaptive memory.

Two key modules enable the adaptive memory:

Frame Selection automatically picks the most semantically relevant historical frames for the current shot prompt, recognizing that not all previous shots are equally important. For example, when the third shot returns to the main character after a secondary‑character shot, the first shot is more critical than the second, and the module selects frames accordingly.

Adaptive Conditioner compresses the selected frames into efficient conditioning signals through adaptive patchification. Important information is retained with fine‑grained patches, while less critical content is heavily compressed, keeping computational cost manageable while providing a concise context to the generator.

The authors also redesign the data construction pipeline. Instead of a full‑story script with predefined shot definitions, the dataset contains only shot‑level prompts that include referential relations to previous shots, mirroring natural storytelling and simplifying user control.

Experimental results show that OneStory consistently maintains character and environment consistency under rapidly changing prompts. Qualitative comparisons demonstrate more faithful adherence to shot‑level captions and superior narrative coherence. Specific evaluations highlight (1) preservation of character identity despite appearance changes, (2) accurate spatial localization when transitioning from wide shots to close‑ups, and (3) sustained narrative continuity during interactions between characters and objects. These findings indicate that OneStory learns not just visual continuity but a cross‑shot narrative understanding.

In summary, OneStory answers the fundamental question of multi‑shot video generation—how to tell a story—by avoiding naïve context window expansion or reliance on single keyframes. Its adaptive memory modeling balances global context representation with computational efficiency, offering a promising direction for long‑duration, high‑consistency video synthesis and controllable world models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Video GenerationAdaptive MemoryMulti-shot VideoOneStory
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.