Artificial Intelligence 21 min read

AI‑Driven 3D Spatial Video Generation from Monocular 2D Content with MV‑HEVC Encoding

This work presents an end‑to‑end AI pipeline that transforms existing monocular 2D videos into immersive 3D spatial streams by combining DINO‑v2‑based depth estimation, multi‑branch view synthesis, and MV‑HEVC encoding, achieving up to 33 % BD‑Rate reduction, 31 % speed gains, state‑of‑the‑art visual quality, and real‑time production suitability, validated on the new StereoV1K benchmark and deployed in JD.Vision’s e‑commerce catalog.

JD Retail Technology
JD Retail Technology
JD Retail Technology
AI‑Driven 3D Spatial Video Generation from Monocular 2D Content with MV‑HEVC Encoding

In recent years, the rapid growth of social media, streaming platforms, and XR devices has created a strong demand for immersive 3D spatial video, especially in short‑video, live‑streaming, and film scenarios. While consumer interest is rising, the supply side suffers from a shortage of professional 3D capture equipment, high production costs, and stringent quality requirements.

To address this bottleneck, we propose an end‑to‑end pipeline that converts existing 2D video assets into 3D spatial video using AI‑generated content (AIGC). The method combines monocular depth estimation, novel view synthesis, and MV‑HEVC (Multi‑View HEVC) encoding. Our work has been accepted by the IEEE International Conference on Multimedia and Expo (ICME 2025) and is already deployed in JD.Vision video channels.

The pipeline consists of three core stages:

1. Monocular Depth Estimation – We adopt a DINO‑v2 backbone with a DPT head, enhanced by a memory bank and attention mechanisms to improve temporal stability. The model is fine‑tuned on a large pseudo‑label dataset generated from short‑video data using supervised fine‑tuning (SFT) and distillation techniques.

2. Novel View Synthesis – Depth‑based warping generates a right‑eye view from the left‑eye frame. Missing regions are filled by a multi‑branch InPaint module that integrates a polygon‑based (Poly‑base), a deep‑learning‑based (DL‑base), and a disparity‑extension (DE‑base) strategy. A hierarchical mask‑fusion network combines the three branches to produce high‑quality, temporally consistent results.

3. MV‑HEVC Encoding – Compared with the conventional side‑by‑side (SBS) HEVC, MV‑HEVC exploits inter‑view prediction to reduce redundancy. Our implementation adds MV‑HEVC support to a standard HEVC encoder, achieving a 33.28 % BD‑Rate reduction and a 31.62 % speed increase over SBS‑HEVC.

To support research and evaluation, we created the StereoV1K dataset, the first large‑scale, high‑resolution (1180×1180) real‑world stereo video benchmark containing 1,000 videos (over 500 k frames). StereoV1K serves as a new baseline for depth estimation, view synthesis, and compression studies.

We also investigated the container format required for Apple devices. By reverse‑engineering the AVFoundation‑generated MOV files, we defined custom boxes (hvc1, hvcC, lhvC) and Apple‑specific metadata (vexu, hfov, blin, dadj) to ensure compatibility with Vision Pro, iPhone, and other XR headsets.

Extensive experiments show that our approach surpasses state‑of‑the‑art methods on LPIPS (‑28 % improvement) and visual quality, while maintaining real‑time inference speeds suitable for large‑scale production. The solution has been applied to JD’s e‑commerce video catalog, converting product shorts, promotional clips, and live events into immersive 3D experiences.

Future work will explore AIGC‑driven 3D/4D generation, world‑model construction, and further optimization of model size and inference latency to meet the constraints of lightweight AR glasses.

Computer VisionAIGCAI generationDepth Estimation3D videomultiview encoding
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.