Artificial Intelligence 14 min read

Automating 3D Spatial Data: Holi‑Spatial’s 4M‑Scale Multimodal Dataset (ICML 2026 Oral)

Holi‑Spatial introduces a fully automatic pipeline that transforms raw video streams into high‑quality 3D geometry, depth, masks, 3D boxes, instance descriptions, grounding and spatial QA, producing the 4‑million‑item Holi‑Spatial‑4M dataset and substantially improving VLM spatial reasoning performance.

Machine Heart

Jun 18, 2026

Automating 3D Spatial Data: Holi‑Spatial’s 4M‑Scale Multimodal Dataset (ICML 2026 Oral)

Overview

Holi‑Spatial converts raw video into a complete hierarchy of supervision for training spatial‑intelligence models, covering geometry reconstruction, semantic annotation, 3D grounding and spatial question answering, and uses the pipeline to build the large‑scale Holi‑Spatial‑4M dataset.

Why Spatial Intelligence Lacks Data

Understanding 3D structure and relationships (e.g., camera motion direction, object relative positions, distances) requires fine‑grained, geometry‑constrained data, which existing manually annotated 3D datasets such as ScanNet and ScanNet++ cannot provide at scale or with sufficient diversity.

Holi‑Spatial Pipeline

The framework consists of three automated stages.

Stage 1: Geometry Optimization

The system recovers camera intrinsics/extrinsics from video, obtains an initial dense point cloud and depth priors from a spatial foundation model, then refines the scene with 3D Gaussian Splatting and geometric regularization to ensure multi‑view depth consistency.

Stage 2: Image‑Level Open‑Vocabulary Perception

Key frames are sampled and processed by a vision‑language model (VLM) to generate open‑vocabulary categories. A dynamic category memory reuses previously identified labels across frames. SAM3 produces open‑vocabulary instance masks, which are back‑projected into 3D using the refined depth, forming initial 3D object candidates. Mask erosion and mesh‑guided depth filtering remove noisy edges and floating points.

Stage 3: Scene‑Level Refinement

Multi‑view merging based on category and 3D IoU to eliminate duplicate instances.

Ground/gravity alignment to enforce vertical consistency.

Confidence filtering to keep high‑confidence instances.

VLM Agent re‑verification with zoom‑in and re‑segmentation tools.

Fine‑grained caption generation and construction of 3D grounding and spatial QA samples.

The output is not merely a reconstruction model but a set of multimodal supervision ready for training spatial‑intelligence models.

Holi‑Spatial‑4M Dataset

Using the pipeline, the authors assembled Holi‑Spatial‑4M, a 4‑million‑scale dataset sourced from ScanNet, ScanNet++ and DL3DV‑10K. Unlike closed‑category datasets, Holi‑Spatial‑4M leverages VLM open‑world knowledge to cover long‑tail, fine‑grained indoor object categories.

Experimental Results

Data quality was evaluated on ScanNet, ScanNet++ and DL3DV‑10K by comparing depth F1, 2D segmentation IoU and 3D detection AP25/AP50 against manually annotated ground truth. Holi‑Spatial achieved superior scores (e.g., on ScanNet++: Depth F1 0.89, 2D IoU 0.64, 3D AP25/AP50 81.06/70.05) and consistently outperformed baselines such as LangSplat, M3‑Spatial, SA2VA and LLaVA‑3D across all three metrics.

Fine‑tuning Qwen3‑VL models with Holi‑Spatial‑4M yielded large gains on spatial QA benchmarks (MMSI‑Bench, MindCube, ViewSpatial, SparBench‑tiny). For the ScanNet++ 3D grounding task, Qwen3‑VL‑8B improved AP50 from 13.50 to 27.98 (+14.48), and AP15/AP25 also rose markedly. The authors attribute these gains to the 1.2 M 3D grounding samples providing strong supervision for cross‑view and occlusion handling.

Key Insight: Automated Data Flywheel

The most notable contribution is turning “spatial data production” into an automated flywheel: raw video → geometry → semantic labels → VLM‑enhanced QA, which can continuously scale as more video becomes available, reducing reliance on costly manual scanning.

Limitations

The system depends on multiple upstream models and per‑scene optimization, leading to high computational cost. Performance may degrade on videos with limited viewpoints, motion blur, severe occlusion, or many dynamic objects. Open‑vocabulary semantic labels can inherit biases from the underlying foundation models, indicating a need for more robust verification and uncertainty estimation.

Conclusion

Holi‑Spatial demonstrates that combining state‑of‑the‑art geometry models, VLMs, segmentation models and 3D optimization can automatically convert raw video into structured, trainable spatial data, suggesting future improvements in spatial‑intelligence models will stem not only from larger models but also from stronger data‑construction pipelines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

3D reconstruction multimodal dataset Large-Scale Data ICML 2026 spatial intelligence video-to-3D

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Overview

Why Spatial Intelligence Lacks Data

Holi‑Spatial Pipeline

Stage 1: Geometry Optimization

Stage 2: Image‑Level Open‑Vocabulary Perception

Stage 3: Scene‑Level Refinement

Holi‑Spatial‑4M Dataset

Experimental Results

Key Insight: Automated Data Flywheel

Limitations

Conclusion

Machine Heart

How this landed with the community

Was this worth your time?

0 Comments

Stage 1: Geometry Optimization

Stage 2: Image‑Level Open‑Vocabulary Perception

Stage 3: Scene‑Level Refinement