From Clip Generation to Long‑Video Roaming: OmniRoam Enables Stable, Trajectory‑Controlled Video Synthesis

OmniRoam introduces a panoramic, coarse‑to‑fine framework that generates long, trajectory‑controlled videos with higher spatial consistency and temporal coherence, offering a stable and controllable alternative to short‑clip generation and supporting real‑time preview, high‑resolution refinement, and 3D reconstruction applications.

Machine Heart
Machine Heart
Machine Heart
From Clip Generation to Long‑Video Roaming: OmniRoam Enables Stable, Trajectory‑Controlled Video Synthesis

Problem

Generative video models can produce high‑quality short clips, but extending generation to long‑duration sequences encounters two major issues: structural drift caused by viewpoint changes and content inconsistency over time, which lead to chaotic results. Controllable motion along a prescribed path is also required for many applications.

OmniRoam Method

OmniRoam introduces a panoramic video as a unified representation and a coarse‑to‑fine, two‑stage generation pipeline that explicitly models camera trajectories.

Stage 1 – Trajectory‑Controlled Preview

The model first generates a medium‑resolution panoramic preview that defines the overall path and scene layout. Camera motion is decomposed into flow (direction) and scale (step size). Input and target videos are concatenated along the temporal axis and conditioned on flow and scale, producing a preview that respects both visual content and trajectory constraints.

Stage 2 – Long‑Horizon Refinement

A visibility mask selects a sparse set of preview frames as conditioning inputs, preserving structural anchors while avoiding redundancy. Each segment is up‑sampled to high resolution and stitched together, mitigating error accumulation across long sequences and yielding a realistic‑speed video.

Dataset and Evaluation

A new dataset is built with a canonical panoramic coordinate system that removes camera roll and retains only translation, simplifying trajectory modeling. The data combine real panoramic videos (diverse scenes) and synthetic sequences (precise trajectory supervision). Evaluation introduces a “loop consistency” metric that requires a video to return to its starting point while maintaining coherent intermediate changes, alongside standard metrics FAED, SSIM, and LPIPS.

Experimental Results

OmniRoam outperforms existing baselines on visual quality, trajectory control, and long‑term consistency. Qualitatively, it produces stable, continuous videos with reduced structural drift. Quantitatively, it achieves higher scores on FAED, SSIM, LPIPS, and loop consistency. Analysis shows that the panoramic representation and two‑stage design are critical, especially on 641‑frame sequences where self‑regressive and perspective‑based baselines degrade. A closed‑loop experiment measures CLIP similarity over the trajectory: similarity drops as the camera moves away from the start and rises again when the loop closes, indicating strong long‑range spatial memory.

Efficiency and 3D Applications

Using a self‑forcing mechanism, the full model is distilled into a lightweight autoregressive preview model that generates 81 frames of panoramic video in about 7 seconds, enabling interactive use. The preview can be refined to higher resolutions (e.g., 720p). For 3D reconstruction, keyframes sampled from the generated video are fed into a 3D Gaussian Splatting pipeline, producing consistent multi‑view 3D scenes.

Resources

Paper: https://arxiv.org/pdf/2603.30045

Project page: https://yuheng.ink/project-page/omniroam/

Code: https://github.com/yuhengliu02/OmniRoam

3D reconstructionGenerative AIvideo synthesislong video generationOmniRoampanoramic videotrajectory control
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.