Artificial Intelligence 20 min read

End-to-End 3D Spatial Video Generation via Monocular Depth Estimation, Novel View Synthesis, and MV‑HEVC Encoding

This article presents a comprehensive AI‑driven pipeline that converts 2D video into immersive 3D spatial video by leveraging monocular depth estimation, depth‑warping novel view synthesis, a multi‑branch inpainting module, a large‑scale StereoV1K dataset, and efficient MV‑HEVC compression, with results validated at ICME 2025 and deployed in JD Vision services.

JD Tech

Apr 21, 2025

End-to-End 3D Spatial Video Generation via Monocular Depth Estimation, Novel View Synthesis, and MV‑HEVC Encoding

With the rapid growth of social media, streaming platforms, and XR devices, demand for immersive 3D spatial video has surged, especially in short‑form, live, and cinematic content. However, production is hindered by scarce professional 3D cameras, high costs, and complex workflows.

JD Retail Content R&D proposes an AI‑based method that transforms existing 2D video assets into 3D spatial video, dramatically reducing supply cost. The approach, accepted at ICME 2025, comprises three core modules: monocular depth estimation, novel view synthesis (including disparity computation, warping, and inpainting), and MV‑HEVC encoding.

Depth estimation builds on recent advances from traditional stereo matching to large‑model and generative techniques, employing a DINO‑v2 backbone with a DPT head and a multi‑frame memory bank to produce relative depth maps suitable for video.

For novel view synthesis, the pipeline computes disparity maps from the estimated depth, warps the left‑eye frames to generate right‑eye views, and fills occlusions using a multi‑branch inpainting module (Poly‑base, DL‑base, DE‑base) followed by a hierarchical mask‑fusion network.

A new high‑quality, large‑scale dataset named StereoV1K was created, containing 1,000 real‑world stereo videos (1180×1180 resolution, >500 k frames) to serve as a benchmark for the field.

The final 3D video is encoded with MV‑HEVC, which exploits inter‑view prediction to achieve up to 33 % BD‑Rate reduction and 31 % speed improvement over SBS‑HEVC. Compatibility with Apple devices was ensured by reverse‑engineering the AVFoundation container and defining custom metadata (vexu, hfov, blin, dadj).

Deployed in JD.Vision, the system converts 2D product shorts, promotional clips, and live streams into stereoscopic video for Vision Pro, Pico, Quest, and AI glasses, delivering immersive experiences while maintaining real‑time performance.

Future work will focus on AIGC‑driven 3D/4D generation, editable world models, and further optimization of speed and quality.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI AIGC Depth estimation 3D video MV-HEVC novel view synthesis StereoV1K

Written by

JD Tech

Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.