Can OmniVGGT Unlock Multi‑Modal 3D Vision with Any Number of Inputs?
OmniVGGT introduces a flexible omni‑modality driven transformer that can ingest arbitrary numbers of geometric cues such as depth maps and camera parameters, achieving state‑of‑the‑art performance on diverse 3D tasks while keeping inference speed comparable to its RGB‑only predecessor.
Overview
General‑purpose 3D foundation models aim to unify many visual tasks, but most rely solely on RGB images and ignore readily available geometric cues like depth and camera poses. OmniVGGT addresses this gap by accepting any combination of such modalities, enabling more accurate and robust 3D scene understanding.
Method
OmniVGGT Architecture
The model processes a set of images together with optional auxiliary inputs (camera intrinsics/extrinsics, depth maps). All inputs are fed into a shared transformer backbone that predicts 3D point clouds, refined camera parameters, depth maps, and confidence maps in an end‑to‑end fashion.
GeoAdapter for Random Multi‑Modal Fusion
The lightweight GeoAdapter consists of a camera adapter and a depth adapter, each normalizing, encoding, and injecting the respective geometric information into the transformer.
Camera Adapter
Given a batch of camera intrinsics and poses, the first camera defines the origin. The average distance of the remaining cameras to this origin serves as a scaling factor, normalizing all poses. The normalized parameters are encoded as a feature vector g = (q, t, f) where q is a rotation quaternion, t a translation vector, and f the field‑of‑view. This vector passes through a dedicated camera encoder to produce auxiliary camera tokens. Images without camera data receive a zero‑vector placeholder. All camera tokens are processed by a zero‑initialized convolution layer and added to the original camera tokens:
Depth Adapter
Depth maps are batch‑normalized using the average depth of valid pixels identified by a mask. The normalized depth is concatenated along the channel dimension and fed into a depth encoder that tokenizes the data and aligns it with spatial tokens:
Images lacking depth receive a dedicated placeholder token, which is directly added to the spatial tokens without extra convolution layers.
Experiments
Datasets and Metrics
Training uses 19 public datasets covering synthetic and real indoor/outdoor scenes (e.g., ARKitScenes, ScanNet, Waymo). Evaluation metrics include Absolute Relative error (Abs Rel) and inlier ratio for depth, Relative Rotation Accuracy (RRA) and Relative Translation Accuracy (RTA) with AUC for pose, and standard 3D reconstruction metrics (Accuracy, Completeness, Normal Consistency).
Results
OmniVGGT consistently outperforms the RGB‑only VGGT baseline across all tasks, even without auxiliary inputs. Adding partial depth or camera information yields monotonic improvements; for example, providing only 30 % of depth data reduces Abs Rel by 69.71 % and improves pose AUC substantially.
Qualitative visualizations show that auxiliary pose cues lead to more accurate camera predictions and realistic geometry, while depth cues enhance fine‑grained details and multi‑view alignment.
On single‑view depth benchmarks (Sintel, NYU‑v2) OmniVGGT surpasses prior art even without extra inputs. Multi‑view depth experiments (ScanNet, ETH3D, DTU, Tanks & Temples) demonstrate superior accuracy and robustness, especially when auxiliary depth supervision is used.
Camera pose estimation on RealEstate10K and Co3Dv2 shows notable gains over VGGT and other baselines, with up to 30× faster inference than the comparable Pow3R model.
For 3D reconstruction on the 7‑Scenes benchmark, OmniVGGT matches VGGT with RGB only and dramatically exceeds it when depth or camera data are available, achieving a 65.4 % reduction in reconstruction error.
Conclusion
OmniVGGT presents a unified, feed‑forward transformer capable of handling arbitrary numbers of geometric modalities. It delivers state‑of‑the‑art results on depth estimation, pose estimation, and 3D reconstruction while maintaining high efficiency, demonstrating that flexible multi‑modal integration is a key driver for next‑generation 3D foundation models.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
