Artificial Intelligence 18 min read

Can Internet Videos Replace 3D Annotations? Introducing SceneVerse++ – the Largest Real‑World 3D Scene Dataset

The BIGAI team presents SceneVerse++, a massive real‑world indoor 3D scene dataset built from unlabelled internet videos via an automated pipeline, and demonstrates substantial zero‑shot and fine‑tuned performance gains on 3D detection, spatial VQA, and vision‑language navigation tasks.

Machine Heart

Apr 30, 2026

Can Internet Videos Replace 3D Annotations? Introducing SceneVerse++ – the Largest Real‑World 3D Scene Dataset

Background

3D scene understanding is a core foundation for embodied AI, robotics, and AR, yet high‑quality 3D annotations remain scarce because they require costly RGB‑D or LiDAR capture, reconstruction, and manual labeling. Existing datasets (ScanNet, ARKitScenes, MultiScan) have not achieved a true scale breakthrough.

Automated Data Engine

The authors design an end‑to‑end automated engine that ingests raw internet "Room Tour" videos and produces fully annotated 3D scenes. The pipeline consists of:

Shot segmentation & filtering: TransNetV2 detects shots; short clips, black screens, noisy frames, people, and outdoor scenes are discarded.

Disparity‑based keyframe extraction: Keyframes are chosen by disparity to ensure stable triangulation while reducing redundancy.

Dense pixel matching & global bundle adjustment: Dense matching plus BA yields robust camera poses and sparse point clouds; for videos >300 frames a pseudo‑track pixel strategy and image‑similarity weighting mitigate false positives.

Quality control: Scenes with insufficient spatial coverage or abnormal SfM results are filtered; a quick manual review (~10 s per scene) keeps the cost low.

SceneVerse++ Dataset

Starting from 8 217 internet video clips, the pipeline produces 6 687 real indoor 3D scenes, surpassing ScanNet, ARKitScenes, and MultiScan in scene count, total area, object categories, and object count. Because the source videos are long, the dataset naturally includes multi‑floor, multi‑room, and large‑scale complex layouts absent from traditional room‑level scans.

Core Module 1: Automated 3D Reconstruction & Instance Segmentation

Beyond sparse SfM, the team adds a dense reconstruction + instance‑segmentation pipeline. Dense reconstruction uses Prior Depth Anything to predict metric depth maps, which are fused into a TSDF volume and filtered for noise. Instance segmentation first obtains 2D masks per frame, aggregates them into 3D via view and spatial consistency, and finally generates textual descriptions and ScanNet‑style category labels using DescribeAnything and Qwen‑VL.

Core Module 2: Structured Spatial VQA Generation

From the geometry and semantics, a scene graph is built where nodes are object instances and edges encode pairwise spatial relations. Using the VLM‑3R template, seven question types are automatically generated (object counting, size, relative/absolute distance, direction, room size, route planning), yielding 632 K VQA samples (391 K multiple‑choice, 241 K fill‑in‑the‑blank) compatible with the VSI‑Bench format.

Core Module 3: VLN Data Generation from Real Videos

Real video trajectories are not directly suitable for navigation because they contain redundant rotations and non‑forward motions. The authors therefore apply a three‑stage pipeline:

Trajectory preprocessing: Remove redundant local rotations and split overly long paths into sub‑paths.

Action encoding: Project SfM camera poses onto the ground plane and discretize forward steps (25/50/75 cm) and rotations (15°/30°/45°), discarding "look‑only" actions.

Instruction generation: A chain‑of‑thought VLM first describes local actions, then composes a natural‑language instruction for the whole segment; three stylistic variants are produced per trajectory.

The pipeline yields 9 631 trajectories (average length 12.8 m, 15 steps) covering 7 189 distinct scenes and 21 567 instructions.

Experimental Results

3D Detection & Instance Segmentation: Using SpatialLM (MLLM‑based) for detection and Mask3D for segmentation, pre‑training on SceneVerse++ improves ScanNet zero‑shot [email protected] and, after fine‑tuning, raises [email protected] from 38.0 to 58.6 (+20.6) and [email protected] from 28.7 to 45.4 (+16.7). Mask3D benefits from fine‑tuning (AP25 15.4 → 38.5) but shows limited zero‑shot transfer, highlighting its sensitivity to segment‑level biases.

3D Spatial VQA: On VSI‑Bench, Qwen2.5‑VL‑3B gains +14.9 % (27.9 → 42.8) and Qwen2.5‑VL‑7B gains +9.8 % (36.6 → 46.4) in average accuracy when pre‑trained on SceneVerse++. Gains are especially large for object counting, size, and relative distance questions. Cross‑domain tests on ARKitScenes show comparable or slightly better performance than ScanNet‑based pre‑training, indicating good generalisation.

Vision‑Language Navigation (VLN): Pre‑training on SceneVerse++ raises R2R zero‑shot success rate (SR) from 0.088 to 0.107 and, after fine‑tuning, to 0.228 (↑159 %). Ablations reveal that both trajectory refinement (TR) and instruction enrichment (IE) are essential; removing IE hurts SR more severely (down to 0.022 zero‑shot). Simple mixing of R2R data with SceneVerse++ (R2R+SV++) underperforms the two‑stage pre‑train → fine‑tune strategy, confirming a domain gap.

Conclusions and Future Directions

The study proves that a carefully engineered automated engine can turn massive unlabelled internet videos into high‑quality, multi‑task 3D scene data. Real video priors boost detection, VQA, and navigation performance more than synthetic scans. The authors stress the importance of scalable models that operate directly on raw modalities, fair evaluation protocols that stress zero‑shot testing, and tighter integration of pipeline sub‑modules to avoid error accumulation. Future work should expand video diversity, improve robustness of SfM, dense reconstruction, and grounding on in‑the‑wild footage, and develop benchmarks that better reflect true 3D spatial intelligence.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multi-task training 3D scene understanding vision-language navigation automated data pipeline internet video dataset SceneVerse++spatial VQA

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Automated Data Engine

SceneVerse++ Dataset

Core Module 1: Automated 3D Reconstruction & Instance Segmentation

Core Module 2: Structured Spatial VQA Generation

Core Module 3: VLN Data Generation from Real Videos

Experimental Results

Conclusions and Future Directions

Machine Heart

How this landed with the community

Was this worth your time?

0 Comments

Core Module 1: Automated 3D Reconstruction & Instance Segmentation

Core Module 2: Structured Spatial VQA Generation

Core Module 3: VLN Data Generation from Real Videos