Automating 3D Spatial Data: Holi‑Spatial’s 4M‑Scale Multimodal Dataset (ICML 2026 Oral)
Holi‑Spatial introduces a fully automatic pipeline that transforms raw video streams into high‑quality 3D geometry, depth, masks, 3D boxes, instance descriptions, grounding and spatial QA, producing the 4‑million‑item Holi‑Spatial‑4M dataset and substantially improving VLM spatial reasoning performance.
Overview
Holi‑Spatial converts raw video into a complete hierarchy of supervision for training spatial‑intelligence models, covering geometry reconstruction, semantic annotation, 3D grounding and spatial question answering, and uses the pipeline to build the large‑scale Holi‑Spatial‑4M dataset.
Why Spatial Intelligence Lacks Data
Understanding 3D structure and relationships (e.g., camera motion direction, object relative positions, distances) requires fine‑grained, geometry‑constrained data, which existing manually annotated 3D datasets such as ScanNet and ScanNet++ cannot provide at scale or with sufficient diversity.
Holi‑Spatial Pipeline
The framework consists of three automated stages.
Stage 1: Geometry Optimization
The system recovers camera intrinsics/extrinsics from video, obtains an initial dense point cloud and depth priors from a spatial foundation model, then refines the scene with 3D Gaussian Splatting and geometric regularization to ensure multi‑view depth consistency.
Stage 2: Image‑Level Open‑Vocabulary Perception
Key frames are sampled and processed by a vision‑language model (VLM) to generate open‑vocabulary categories. A dynamic category memory reuses previously identified labels across frames. SAM3 produces open‑vocabulary instance masks, which are back‑projected into 3D using the refined depth, forming initial 3D object candidates. Mask erosion and mesh‑guided depth filtering remove noisy edges and floating points.
Stage 3: Scene‑Level Refinement
Multi‑view merging based on category and 3D IoU to eliminate duplicate instances.
Ground/gravity alignment to enforce vertical consistency.
Confidence filtering to keep high‑confidence instances.
VLM Agent re‑verification with zoom‑in and re‑segmentation tools.
Fine‑grained caption generation and construction of 3D grounding and spatial QA samples.
The output is not merely a reconstruction model but a set of multimodal supervision ready for training spatial‑intelligence models.
Holi‑Spatial‑4M Dataset
Using the pipeline, the authors assembled Holi‑Spatial‑4M, a 4‑million‑scale dataset sourced from ScanNet, ScanNet++ and DL3DV‑10K. Unlike closed‑category datasets, Holi‑Spatial‑4M leverages VLM open‑world knowledge to cover long‑tail, fine‑grained indoor object categories.
Experimental Results
Data quality was evaluated on ScanNet, ScanNet++ and DL3DV‑10K by comparing depth F1, 2D segmentation IoU and 3D detection AP25/AP50 against manually annotated ground truth. Holi‑Spatial achieved superior scores (e.g., on ScanNet++: Depth F1 0.89, 2D IoU 0.64, 3D AP25/AP50 81.06/70.05) and consistently outperformed baselines such as LangSplat, M3‑Spatial, SA2VA and LLaVA‑3D across all three metrics.
Fine‑tuning Qwen3‑VL models with Holi‑Spatial‑4M yielded large gains on spatial QA benchmarks (MMSI‑Bench, MindCube, ViewSpatial, SparBench‑tiny). For the ScanNet++ 3D grounding task, Qwen3‑VL‑8B improved AP50 from 13.50 to 27.98 (+14.48), and AP15/AP25 also rose markedly. The authors attribute these gains to the 1.2 M 3D grounding samples providing strong supervision for cross‑view and occlusion handling.
Key Insight: Automated Data Flywheel
The most notable contribution is turning “spatial data production” into an automated flywheel: raw video → geometry → semantic labels → VLM‑enhanced QA, which can continuously scale as more video becomes available, reducing reliance on costly manual scanning.
Limitations
The system depends on multiple upstream models and per‑scene optimization, leading to high computational cost. Performance may degrade on videos with limited viewpoints, motion blur, severe occlusion, or many dynamic objects. Open‑vocabulary semantic labels can inherit biases from the underlying foundation models, indicating a need for more robust verification and uncertainty estimation.
Conclusion
Holi‑Spatial demonstrates that combining state‑of‑the‑art geometry models, VLMs, segmentation models and 3D optimization can automatically convert raw video into structured, trainable spatial data, suggesting future improvements in spatial‑intelligence models will stem not only from larger models but also from stronger data‑construction pipelines.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
