End-to-End 3D Spatial Video Generation via Monocular Depth Estimation, Novel View Synthesis, and MV-HEVC Encoding
Leveraging AI-driven monocular depth estimation, novel view synthesis, and MV‑HEVC encoding, the JD Retail Content R&D team presents an end‑to‑end pipeline that converts 2D video assets into high‑quality immersive 3D spatial videos, introduces the large‑scale StereoV1K dataset, and demonstrates superior performance over existing methods.
In recent years, the rapid growth of social media, streaming platforms, and XR devices has created a strong demand for immersive 3D spatial video, especially in short video, live streaming, and film domains. Despite rising consumer demand, the supply side faces bottlenecks due to scarce professional 3D capture equipment, high production complexity, and cost.
The JD Retail Content R&D team proposes an innovative solution that leverages 3D vision and AIGC techniques to continuously transform existing 2D video resources into 3D spatial video, dramatically reducing production costs and increasing coverage. The method has been accepted by the flagship multimedia conference ICME 2025 and deployed in JD.Vision video channels.
ICME 2025, organized by IEEE, focuses on 3D multimedia, AR, VR, immersive media, and computer vision. The team’s submission, titled “SpatialMe: Stereo Video Conversion Using Depth‑Warping and Blend‑Inpainting,” describes a pipeline that includes depth estimation, image generation, and a new 3D video dataset for benchmarking.
3D spatial video generation is a novel view synthesis task that requires rendering left‑eye and right‑eye images for a given pose. State‑of‑the‑art solutions include NeRF, Gaussian Splatting, and Diffusion Models. Unlike generic view synthesis, 3D video must produce stereoscopic pairs with appropriate eye‑parallax.
The proposed end‑to‑end system consists of three core modules: monocular depth estimation, novel view synthesis (including disparity computation, warping, and hole‑filling), and MV‑HEVC encoding. The overall architecture is illustrated in Figure 2.
Monocular depth estimation, a fundamental computer‑vision problem, predicts per‑pixel distance from a single image. Traditional hardware‑based methods (TOF, LiDAR) and stereo matching are contrasted with learning‑based monocular approaches, which are more cost‑effective but challenging. The team adopts a DINO‑v2 backbone with a DPT head, incorporates a multi‑frame memory bank and attention mechanisms, and trains on a large pseudo‑labeled dataset using supervised fine‑tuning and distillation, achieving high‑detail and temporally stable depth maps.
For novel view synthesis, the pipeline computes disparity maps from the estimated depth, warps the input monocular video to generate the right‑eye view, and fills occluded regions using a multi‑branch InPaint module. The InPaint module combines three strategies: polygon‑based interpolation, deep‑learning‑based neural repair, and disparity‑extension, followed by a hierarchical mask‑fusion network that merges the outputs into a final high‑quality stereoscopic frame.
A new high‑quality, large‑scale StereoV1K dataset is introduced, containing 1,000 real‑world stereoscopic videos captured with Canon RF‑S7.8mm F4 STM dual lenses and EOS R7 cameras. Each video is 1180×1180 pixels, 20 seconds long at 50 fps, totaling over 500 k frames, and serves as a benchmark for the field.
Extensive quantitative and qualitative evaluations show the method achieves state‑of‑the‑art performance, improving LPIPS by over 28 % compared to prior work and reducing visual artifacts such as blurring and edge stretching.
Given that 3D video doubles the data volume of traditional 2D video, efficient compression is crucial. Two encoding schemes are compared: SBS‑HEVC (side‑by‑side concatenation) and MV‑HEVC (multi‑view extension). MV‑HEVC leverages inter‑view prediction to achieve higher compression efficiency. The team extends a standard HEVC encoder with MV‑HEVC support, attaining a 33.28 % BD‑Rate reduction and a 31.62 % speedup over SBS‑HEVC.
To ensure compatibility with Apple devices, the team reverse‑engineered Apple’s custom MV‑HEVC container, defining MOV/MP4 structures with new hvc1, hvcC, lhvC atoms and custom vexu/hfov metadata describing baseline, disparity adjustment, and horizontal field‑of‑view.
In production, the model size is reduced by using a ViT‑S encoder and lightweight transformer InPaint, achieving a balance between speed and quality. Deployed on JD.Vision, the pipeline converts 2D product videos, promotional clips, and live events into immersive 3D experiences across Vision Pro, Pico, Quest, and AI glasses.
Future work envisions broader AIGC‑driven 3D/4D generation, including editable 3D models (e.g., Trellis) and 4D Gaussian Splatting for dynamic scenes, as well as comprehensive world models that capture temporal and spatial structure for interactive simulation and prediction.
JD Tech Talk
Official JD Tech public account delivering best practices and technology innovation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.