Artificial Intelligence 5 min read

Deterministic Video Depth (DVD): Open‑Source Framework Achieves Zero‑Shot SOTA

The DVD framework converts a pretrained video diffusion model into a deterministic, single‑pass video depth estimator, eliminating random sampling artifacts, preserving geometric and semantic priors, and reaching zero‑shot state‑of‑the‑art performance with 163× less training data.

HyperAI Super Neural

Jun 17, 2026

Deterministic Video Depth (DVD): Open‑Source Framework Achieves Zero‑Shot SOTA

Video depth estimation is a fundamental yet challenging task in 3D vision, required by autonomous driving, robotics, AR/VR, digital twins, and video content generation. Generative diffusion models offer strong semantic understanding but suffer from randomness, causing geometric hallucinations, scale drift, and temporal instability, while traditional discriminative models need massive labeled datasets and struggle to generalize.

The Hong Kong University of Science and Technology (Guangzhou) team introduced DVD (Deterministic Video Depth Estimation), the first method that converts a pretrained video diffusion model into a deterministic depth estimator using a single forward pass. This design removes the need for iterative sampling, dramatically improves inference speed, and fully resolves random‑sampling‑induced artifacts, ensuring consistent temporal and structural outputs.

DVD retains the rich geometric and semantic priors of the underlying diffusion model through an innovative structure‑anchor mechanism and Latent Manifold Rectification (LMR) technique. These components maintain global scene stability while accurately restoring object edges, high‑frequency textures, and motion details, substantially boosting the structural fidelity of the depth maps.

Benchmark evaluations on multiple public datasets show that DVD achieves zero‑shot state‑of‑the‑art performance while being trained on only 367,000 video frames. Compared with leading discriminative approaches that require around 60 million frames, DVD reduces the training data volume by approximately 163×, demonstrating the powerful geometric reasoning capability of generative priors and offering a low‑cost, high‑precision route for video 3D perception.

To facilitate rapid experimentation, HyperAI provides an online notebook that deploys DVD with one click. Users can run the tutorial at https://go.hyper.ai/w8kUO, clone the repository (https://github.com/EnVision-Research/DVD), select an NVIDIA RTX 5090 GPU and a PyTorch environment, and follow the step‑by‑step instructions to launch the Jupyter workspace, execute the notebook, and view the demo results.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

computer vision generative models zero-shot learning deterministic inference HKUST video depth estimation

Written by

HyperAI Super Neural

Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.