Artificial Intelligence 26 min read

How AI Turns 2D Videos into Immersive 3D Spatial Content at Scale

Leveraging 3D vision and AIGC, JD Retail’s R&D team converts abundant 2D video assets into high‑quality stereoscopic 3D space videos through a pipeline that includes monocular depth estimation, novel view synthesis, multi‑branch inpainting, and MV‑HEVC encoding, validated by ICME 2025 and a new StereoV1K dataset.

JD Cloud Developers

Apr 22, 2025

How AI Turns 2D Videos into Immersive 3D Spatial Content at Scale

In recent years, the rapid growth of social media, streaming platforms, and XR devices has driven a surge in demand for immersive 3D spatial video, especially in short‑form, live‑stream, and film domains. While consumer demand rises, the supply side faces bottlenecks due to scarce professional 3D capture hardware, high production complexity, and cost.

JD Retail’s content product R&D team proposes an innovative method that leverages 3D vision and AIGC to continuously transform existing 2D video resources into 3D spatial video, dramatically reducing production cost and increasing coverage. The approach has been accepted by the flagship multimedia conference ICME 2025 and deployed in JD.Vision video channels.

Technical Solution

3D spatial video generation is a novel view synthesis task that renders images for a target pose given a source view. State‑of‑the‑art solutions include NeRF, Gaussian Splatting, and Diffusion models. Unlike generic view synthesis, 3D spatial video must provide a left‑eye and a right‑eye view with a fixed pose offset, requiring algorithms to generate the right‑eye frame from a single left‑eye input.

The end‑to‑end pipeline consists of three core modules: monocular depth estimation, novel view synthesis (including disparity computation, warping, and hole‑filling), and MV‑HEVC encoding. The overall architecture is illustrated in Figure 2.

3D spatial video generation architecture

1.1 Monocular Depth Estimation

Depth estimation is a fundamental computer‑vision problem that infers scene geometry from images or video. It underpins AR/VR, robot navigation, and autonomous driving. Traditional approaches rely on hardware (TOF, LiDAR) or stereo matching; monocular methods are cheaper but more challenging. Recent advances have moved from classic algorithms to deep learning and, finally, to large‑model or generative techniques.

Our solution adopts DINO v2 as the backbone, integrates a DPT head, and introduces a multi‑frame memory bank with attention mechanisms to improve temporal stability and accuracy. The architecture is shown in Figure 4.

1.2 Novel View Synthesis

Novel view synthesis aims to generate new viewpoints from limited source views, a key technology for VR, AR, film VFX, and games. Although NeRF, Gaussian Splatting, and diffusion models have made progress, they face challenges such as scene‑specific modeling and temporal inconsistency. For our use case, we only need a fixed‑pose right‑eye image, so we adopt a depth‑warp followed by an InPaint hole‑filling strategy.

After obtaining depth, we compute a disparity map, warp the left‑eye video to the target right‑eye pose, and fill missing regions with a custom InPaint framework, yielding the final right‑eye view (Figure 6).

End‑to‑end novel view synthesis architecture

1.2.1 Multi‑branch InPaint Module

To achieve high‑quality and consistent filling, we employ three branches: a traditional polygon‑based interpolator (stable but prone to edge artifacts), a deep‑learning‑based neural network (good edge quality but less stable), and a disparity‑extension strategy (reduces stretching and foreground leakage). By fusing the complementary strengths of these branches, we obtain superior results (Figure 7).

Disparity extension strategy effectiveness

1.2.2 StereoV1K Dataset

Existing public 3D video datasets suffer from low resolution, limited scenes, and poor realism. To address this, we built StereoV1K, the first high‑quality real‑world stereoscopic video benchmark. Using Canon RF‑S7.8mm F4 STM dual lenses and EOS R7 cameras, we captured 1,000 videos (1180×1180, ~20 s, 50 fps), totaling over 500 k frames. StereoV1K will serve as a standard benchmark for the community (Figure 9).

1.3 MV‑HEVC Encoding

Generated stereoscopic videos double the data volume of 2D video, making efficient compression essential. Traditional SBS‑HEVC simply concatenates left and right frames, resulting in low compression efficiency. MV‑HEVC encodes multiple views in a single bitstream, exploiting inter‑view redundancy for higher compression. Our MV‑HEVC extension reduces BD‑Rate by 33.28 % and speeds up encoding by 31.62 % compared to SBS‑HEVC (Figures 10‑11).

1.4 Applications and Deployment

To bring the technology to production, we trimmed the depth model using ViT‑S and fine‑tuned it with supervised fine‑tuning. The InPaint backbone was replaced with a lightweight transformer, trained on StereoV1K. Tests on large‑scale video streams show the generated 3D content meets business needs, though occasional artifacts remain. Future work focuses on further speed‑quality trade‑offs.

3D spatial videos are now viewable on Vision Pro, Pico, Quest, and AI glasses. JD.Vision leverages the pipeline to convert 2D product shorts, promos, and launch videos into immersive stereoscopic experiences, enhancing user engagement across e‑commerce and advertising scenarios.

Future Outlook

2.1 AIGC 3D/4D

Since 2024, AIGC for 3D/4D has accelerated, with approaches such as Google’s CAT3D, InstantMesh, and Trellis pushing multi‑view diffusion and structured 3D representations. Editable 3D generation and controllable synthesis are emerging research directions (Figure 13).

2.2 World Models

World models aim to capture spatio‑temporal structure with dense semantic representation and editability, enabling realistic scene reconstruction, creation, and prediction. Recent works like World Labs and Meta Orion illustrate the potential of such models for immersive AR/VR experiences (Figure 15).

References

Zhang J, Jia Q, Liu Y, et al. SpatialMe: Stereo Video Conversion Using Depth‑Warping and Blend‑Inpainting. arXiv preprint arXiv:2412.11512, 2024.

Yang, Sung‑Pyo, et al. Optical MEMS devices for compact 3D surface imaging cameras. Micro and Nano Systems Letters 7 (2019): 1‑9.

Bhat S F, Birkl R, Wofk D, et al. Zoedepth: Zero‑shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023.

Lihe Yang, Bingyi Kang, Zilong Huang, et al. Depth Anything V2. arXiv preprint arXiv:2406.09414, 2024.

Teed Z, Deng J. RAFT: Recurrent All‑Pairs Field Transforms for Optical Flow. ECCV 2020.

Zhang K, Fu J, Liu D. Flow‑guided transformer for video inpainting. European Conference on Computer Vision, 2022.

Shangchen Zhou, Chongyi Li, Kelvin CK Chan, and Chen Change Loy. Propainter: Improving propagation and transformer for video inpainting. ICCV 2023.

Han Y, Wang R, Yang J. Single‑view view synthesis in the wild with learned adaptive multiplane images. ACM SIGGRAPH 2022.

Wang L, Frisvad J R, Jensen M B, et al. Stereodiffusion: Training‑free stereo image generation using latent diffusion models. CVPR 2024.

Zhen Lv, Yangqi Long, Congzhentao Huang, et al. SpatialDreamer: Self‑supervised stereo video synthesis from monocular input. arXiv preprint arXiv:2411.11934, 2024.

Mildenhall B, Srinivasan P P, Tancik M, et al. NeRF: Representing scenes as neural radiance fields for view synthesis. CACM 2021.

Kerbl B, Kopanas G, Leimkühler T, et al. 3D Gaussian Splatting for real‑time radiance field rendering. ACM TOG 2023.

Gao R, Holynski A, Henzler P, et al. CAT3D: Create anything in 3D with multi‑view diffusion models. arXiv preprint arXiv:2405.10314, 2024.

Xu J, Cheng W, Gao Y, et al. InstantMesh: Efficient 3D mesh generation from a single image with sparse‑view large reconstruction models. arXiv preprint arXiv:2404.07191, 2024.

Xu Z, Xu Y, Yu Z, et al. Representing long volumetric video with temporal Gaussian hierarchy. ACM TOG 2024.

Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.

Xiang J, et al. Structured 3D latents for scalable and versatile 3D generation. arXiv preprint arXiv:2412.01506, 2024.

Wu G, Yi T, Fang J, et al. 4D Gaussian splatting for real‑time dynamic scene rendering. CVPR 2024.

Shih‑Li M, Shih‑Yang S, Johannes K, Jia‑Bin H. 3D photography using context‑aware layered depth inpainting. CVPR 2020.

computer vision AIGC Depth estimation 3D video MV-HEVC novel view synthesis StereoV1K

Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.