CineMaster: A 3D‑Aware and Controllable Framework for Cinematic Text‑to‑Video Generation
Researchers introduce CineMaster, a SIGGRAPH‑2025 paper presenting a 3D‑aware, controllable text‑to‑video generation framework that lets users define target objects and camera motions via an interactive workflow, enabling cinematic video creation with high‑quality, user‑directed results.
Sora, Keling and other video generation models have shown impressive performance, allowing creators to produce high‑quality videos from text alone. However, traditional filmmaking involves directors arranging multiple moving targets and camera angles within a scene, a capability lacking in current text‑to‑video models.
To address this gap, the Keling research team proposes CineMaster, a movie‑grade text‑to‑video generation framework accepted to SIGGRAPH 2025. CineMaster enables users to control both 3D objects and camera motion through an interactive workflow, allowing professional‑level scene layout and motion specification.
Paper title: CineMaster: A 3D‑Aware and Controllable Framework for Cinematic Text‑to‑Video Generation Paper URL: https://arxiv.org/abs/2502.08639 Project page: https://cinemaster-dev.github.io/
1. Joint Object‑Camera Control
a) Object‑camera joint control (illustrated with GIFs). b) Object motion control. c) Camera motion control. CineMaster can generate videos that follow fine‑grained multimodal control signals, supporting large‑scale object and camera movements.
2. CineMaster Framework
The framework follows a two‑stage workflow:
Stage 1: Users interactively adjust 3D bounding boxes and camera positions in a 3D space, exporting camera trajectories and per‑frame depth maps as conditioning signals.
Stage 2: A semantic layout ControlNet integrates object motion signals and class labels, while a Camera Adapter incorporates global camera motion, enabling precise control over each target’s movement.
3. Training Data Construction Pipeline
Enhance open‑vocabulary object detection (Grounding DINO) with Qwen2‑VL descriptions and perform video instance segmentation using SAM v2.
Estimate absolute depth with DepthAnything V2.
Compute 3D bounding boxes from depth‑projected masks at the frame with maximum object mask.
Use Spatial Tracker for 3D point tracking and obtain camera trajectories via MonST3R.
4. Comparison Results
Compared with baseline methods, CineMaster uniquely associates motion conditions with specific targets and decouples object and camera motion, producing higher‑quality videos that satisfy textual prompts and control signals.
5. Conclusion
The authors aim to provide powerful 3D‑aware controllable video generation, allowing users to act like professional directors. They designed an interactive 3D workflow, built a multimodal conditional video generation model, and created a data pipeline for extracting 3D control signals from arbitrary videos, offering valuable insights for the research community.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.