Monocular Open‑Vocabulary Occupancy Prediction Sets New SOTA for Indoor 3D Scenes (CVPR 2026 Oral)

The paper introduces LegoOcc, a monocular open‑vocabulary occupancy framework that unifies geometry and semantics via language‑embedded Gaussians, uses Poisson‑based aggregation and progressive temperature decay, and achieves over twice the previous mIoU on Occ‑ScanNet while running at 22.47 FPS, making it well suited for embodied robots.

Machine Heart
Machine Heart
Machine Heart
Monocular Open‑Vocabulary Occupancy Prediction Sets New SOTA for Indoor 3D Scenes (CVPR 2026 Oral)

In embodied perception, agents must understand both fine‑grained geometry and open‑vocabulary semantics of indoor environments, but most existing occupancy models are limited to a closed set of categories defined during training.

LegoOcc Overview

LegoOcc is the first monocular open‑vocabulary 3D occupancy predictor presented at CVPR 2026 (Oral). It represents scenes with language‑embedded Gaussians (LE‑Gaussians) , each carrying position, scale, covariance, opacity, and a semantic embedding aligned to language space, enabling arbitrary text queries without extra semantic supervision.

Key Technical Components

(1) Language‑embedded 3D Gaussians from a single image : a feed‑forward network predicts a set of Gaussians that jointly encode geometry and a language‑aligned semantic vector, merging the geometry and semantics branches into a single representation.

(2) Poisson‑based Gaussian‑to‑Occupancy conversion (G2O) : each Gaussian’s contribution to a voxel is treated as a Poisson event intensity; the probability of occupancy is the chance of at least one event. This formulation is more stable under weak binary occupancy supervision than Bernoulli‑style aggregation.

(3) Progressive Temperature Decay : a temperature‑scaled sigmoid controls opacity during training, gradually annealing from high to low temperature. This reduces feature mixing along rays in dense indoor scenes while preserving differentiable gradients, improving per‑Gaussian language alignment.

Experimental Validation

On the Occ‑ScanNet benchmark, LegoOcc achieves 21.05 mIoU and 59.50 IoU under the open‑vocabulary setting, more than a two‑fold increase in mIoU over the previous best method LOcc and surpassing all closed‑vocabulary baselines.

**Ablation of aggregation**: using GaussianFormer2‑style aggregation yields 0.00 mIoU/0.00 IoU; Bernoulli aggregation improves to 17.25 mIoU/46.65 IoU; Poisson aggregation further raises performance to 21.05 mIoU/59.50 IoU, demonstrating its stability under weak supervision.

**Ablation of temperature strategy**: fixing a high temperature during both training and testing keeps geometry IoU decent but drops mIoU; high‑to‑low temperature only at test time causes train‑test mismatch; low temperature from the start hampers optimization. Progressive temperature decay yields the best balance of stability and semantic discrimination.

Inference on a single RTX 4090 runs at 22.47 FPS , noticeably faster than competing methods, highlighting suitability for real‑time robot platforms.

Qualitative Results

Visualizations compare closed‑vocabulary predictions (e.g., only “chair”, “table”) with open‑vocabulary outputs that respond to queries such as “shoes”, “paper towel”, or “remote control”, producing heatmaps for previously unseen objects.

Future Outlook

The authors envision home robots that can locate items by a single natural‑language command—e.g., “find the remote on the coffee table”—without having been explicitly trained on that object class, thanks to LegoOcc’s open‑vocabulary capability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Roboticsopen-vocabulary3D visionCVPR 2026MonocularOccupancy Prediction
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.