End-to-End 3D Spatial Video Generation via Monocular Depth Estimation, Novel View Synthesis, and MV‑HEVC Encoding
This article presents a comprehensive AI‑driven pipeline that converts 2D video into immersive 3D spatial video by leveraging monocular depth estimation, depth‑warping novel view synthesis, a multi‑branch inpainting module, a large‑scale StereoV1K dataset, and efficient MV‑HEVC compression, with results validated at ICME 2025 and deployed in JD Vision services.
With the rapid growth of social media, streaming platforms, and XR devices, demand for immersive 3D spatial video has surged, especially in short‑form, live, and cinematic content. However, production is hindered by scarce professional 3D cameras, high costs, and complex workflows.
JD Retail Content R&D proposes an AI‑based method that transforms existing 2D video assets into 3D spatial video, dramatically reducing supply cost. The approach, accepted at ICME 2025, comprises three core modules: monocular depth estimation, novel view synthesis (including disparity computation, warping, and inpainting), and MV‑HEVC encoding.
Depth estimation builds on recent advances from traditional stereo matching to large‑model and generative techniques, employing a DINO‑v2 backbone with a DPT head and a multi‑frame memory bank to produce relative depth maps suitable for video.
For novel view synthesis, the pipeline computes disparity maps from the estimated depth, warps the left‑eye frames to generate right‑eye views, and fills occlusions using a multi‑branch inpainting module (Poly‑base, DL‑base, DE‑base) followed by a hierarchical mask‑fusion network.
A new high‑quality, large‑scale dataset named StereoV1K was created, containing 1,000 real‑world stereo videos (1180×1180 resolution, >500 k frames) to serve as a benchmark for the field.
The final 3D video is encoded with MV‑HEVC, which exploits inter‑view prediction to achieve up to 33 % BD‑Rate reduction and 31 % speed improvement over SBS‑HEVC. Compatibility with Apple devices was ensured by reverse‑engineering the AVFoundation container and defining custom metadata (vexu, hfov, blin, dadj).
Deployed in JD.Vision, the system converts 2D product shorts, promotional clips, and live streams into stereoscopic video for Vision Pro, Pico, Quest, and AI glasses, delivering immersive experiences while maintaining real‑time performance.
Future work will focus on AIGC‑driven 3D/4D generation, editable world models, and further optimization of speed and quality.
JD Tech
Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.