Artificial Intelligence 8 min read

How Split4D Reconstructs 4D Dynamic Scenes Without Video Segmentation

Split4D introduces a novel paradigm for 4D scene reconstruction that eliminates the need for video segmentation by leveraging single‑frame 2D masks, a freetime feature learning strategy, streaming sampling, and 3D tracking regularization, achieving high‑quality, object‑level decoupled representations across complex dynamic scenarios.

AntTech

Feb 10, 2026

How Split4D Reconstructs 4D Dynamic Scenes Without Video Segmentation

Background : The rapid rise of 3D Gaussian Splatting (3DGS) has spurred interest in extending static scene reconstruction to dynamic 4D scenarios, yet existing pipelines rely on fragile video segmentation and tracking, leading to drift and multi‑view inconsistencies.

Problem : Video‑based methods suffer from upstream dependency fragility, label inconsistency across views, and error propagation from 2D to 4D, making object‑level decoupling unreliable.

Solution – Split4D : Split4D reformulates the task as a "single‑frame" problem. It uses accurate 2D segmentation masks (e.g., from SAM) as supervision and introduces three core modules:

Freetime FeatureGS : Extends each Gaussian primitive with a velocity vector v and a learnable feature vector, turning each Gaussian into an "identity‑bearing moving particle" that can model linear motion in short time windows.

Streaming Sampling Strategy : Enforces ordered, time‑consistent sampling and feature propagation, preventing abrupt feature changes and ensuring temporal coherence.

3D Tracking Regularization & DINOv2 Semantic Guidance : Regularizes motion trajectories and leverages semantic cues to avoid feature chain breaks, especially under occlusion or object disappearance.

The method also employs a contrastive loss that pulls the Gaussian‑projected features of the same 2D mask together, extending the 2D supervision into a 4D signal.

Training Strategy : Combines single‑frame contrastive learning with a "relay" style feature update across frames, ensuring smooth temporal feature evolution.

Experiments : Evaluated on Neural3DV, Multi‑Human, SelfCap, and compared against SA4D, SADG, OmniSeg3D. Split4D consistently outperforms baselines on key metrics, delivering clearer mask boundaries and superior decoupling quality in fast‑moving and interaction‑heavy scenes.

Applications : Demonstrates high‑quality 4D scene editing (object removal, duplication, motion editing), monocular street‑scene decomposition, and potential for autonomous‑driving data annotation using only RGB and LiDAR with SAM masks.

Conclusion : Split4D represents a breakthrough in 4D dynamic scene understanding by replacing video‑segmentation pipelines with a single‑frame, 3D‑correspondence driven approach, opening new avenues for AIGC content creation and digital‑human applications.