How Context-as-Memory Enables Scene‑Consistent Long Video Generation

This article introduces the Context-as-Memory approach, which treats previously generated video frames as memory to achieve scene‑consistent interactive long video generation, and details a camera‑trajectory‑based memory retrieval mechanism that dramatically improves efficiency and performance over existing state‑of‑the‑art methods.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
How Context-as-Memory Enables Scene‑Consistent Long Video Generation

Overview

Recent advances in video generation models have shown great promise for creating realistic simulations of the physical world, but long‑duration generation still suffers from a lack of stable scene memory, causing abrupt visual changes when the camera moves.

Problem

Existing methods rely on a limited temporal window of past frames, which cannot maintain consistent scene understanding over extended periods; this limits applications in gaming, autonomous driving, and embodied AI.

Proposed Method: Context‑as‑Memory

The authors propose treating the entire history of generated frames as a memory bank, enabling the model to implicitly learn 3D priors without explicit 3D modeling. By applying context‑learning techniques, the model can control scene consistency across long video sequences.

Memory Retrieval

To avoid the prohibitive cost of using all past frames, a Memory Retrieval module selects a small set of relevant frames based on camera‑trajectory field‑of‑view (FOV) overlap, dramatically reducing computational load while preserving essential contextual information.

Experiments

A diverse dataset of long videos with precise camera trajectories was collected using Unreal Engine 5. Experiments demonstrate that Context‑as‑Memory outperforms current SOTA approaches, including Google DeepMind’s Genie 3, in maintaining scene memory and generalizing to unseen domains.

Conclusion

Context‑as‑Memory achieves scene‑consistent interactive long video generation without explicit 3D assistance, offering a scalable solution for future world‑model applications.

References

Context as Memory: Scene‑Consistent Interactive Long Video Generation with Memory Retrieval (arXiv:2506.03141)

A Survey of Interactive Generative Video (arXiv:2504.21853)

Position: Interactive Generative Video as Next‑Generation Game Engine (arXiv:2503.17359)

GameFactory: Creating New Games with Generative Interactive Videos (ICCV 2025 Highlight)

Illustration
Illustration
AIVideo Generationcontext memorylong videoMemory retrievalscene consistency
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.