CameraSquad: Precise Camera Control and Multi‑View Consistency for Spatially Intelligent Video Models
CameraSquad introduces a parallel multi‑trajectory video generation framework that delivers precise camera control and cross‑view content consistency, enabling high‑quality 3D point‑cloud reconstruction and superior performance on benchmarks such as WebVid and HumanVid compared with prior camera‑controlled video methods.
Research Background
Camera‑controlled video generation has become a key research direction in video synthesis and spatial intelligence. Existing works fall into implicit control (e.g., CameraCtrl, MotionCtrl, CamCo, Direct‑a‑Video, ReCamMaster) and explicit modeling (e.g., ViewCrafter, Gen3C). All of them rely on single‑trajectory serial inference, which hampers efficiency, camera precision, and cross‑view content consistency, and consequently degrades downstream 4D reconstruction and immersive AR/VR applications.
CameraSquad Overview
CameraSquad, built on the Wan2.2 video diffusion model, proposes a framework that supports parallel generation of multiple camera trajectories while maintaining precise camera control and cross‑view consistency. Given an input video and several target camera parameter sets, the system produces spatially consistent videos and a dense dynamic 3D point cloud for downstream tasks.
Algorithmic Design
Decoupled Content and Camera Attention. The model separates world‑content information from camera‑position information. Content‑Attention replaces the original 3D self‑attention in DiT to fuse input‑video tokens with noisy target tokens frame‑wise, enabling effective cross‑learning between reference content and generation targets. A parallel Camera‑Attention pathway encodes intrinsic and extrinsic camera parameters using the PRoPE mechanism, which splits the feature dimension into three parts: the first encodes the 3D projection matrix P derived from intrinsics and view matrix, while the latter two encode 2D rotational embeddings along the x‑ and y‑axes. This pathway injects spatial control via a zero‑initialized projection layer that is frozen during training, preserving generation quality.
Dual‑Mode Cross‑View Attention (CVA). To overcome the inconsistency of serial inference, CVA‑α and CVA‑β are introduced. CVA‑α ensures content consistency by using reference video tokens as Key/Value and noisy tokens of each trajectory as Query, reshaping them so that tokens of the same frame but different views attend to each other, guaranteeing identical appearance of objects across viewpoints. CVA‑β enforces geometric consistency by computing PRoPE‑based attention along the view dimension, allowing multi‑view geometric supervision to participate directly in attention calculations. Both modules are alternately inserted into even‑numbered DiT blocks.
After generating multi‑view consistent videos, CameraSquad applies DA3 for depth estimation and back‑projects the results into a dynamic point cloud. Multi‑view fusion yields larger, finer point clouds that capture scene dynamics, providing high‑quality 3D world states for downstream spatial‑intelligence tasks. Training follows a two‑stage scheme: first, low‑resolution single‑trajectory spatial control is learned; second, CVA modules are introduced for full‑resolution parallel generation, with a noise‑injection strategy to bridge the domain gap between synthetic and real data.
Experimental Results
On the WebVid and HumanVid datasets, CameraSquad outperforms ReCamMaster, TrajectoryCrafter, and Gen3C in camera‑control accuracy (rotation error as low as 1.42°, translation error 3.47 px) and achieves the highest MPI and MPO scores, indicating superior cross‑view content consistency and point‑cloud matching quality. Visual quality metrics also improve: FID 30.78 and CLIP‑V 91.37 on HumanVid surpass all baselines, and VBench scores for aesthetics, imaging quality, motion smoothness, background consistency, and subject consistency reach the top ranks (e.g., motion smoothness 0.9891). Qualitative comparisons show that CameraSquad maintains identical appearance, texture, and position of objects across up to six simultaneous trajectories, whereas serial methods exhibit noticeable inconsistencies.
Conclusion
CameraSquad demonstrates that video generation can evolve from isolated frame synthesis to coherent world modeling. By decoupling spatial and content attention and introducing dual‑mode cross‑view attention, it achieves precise camera control, multi‑view consistency, and high‑quality depth‑aware 3D reconstruction, advancing spatial‑intelligent video models for 4D reconstruction, scene understanding, and autonomous driving.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
