How AI Turns 2D Videos into Immersive 3D Spatial Content at Scale

Leveraging 3D vision and AIGC, JD Retail’s R&D team converts abundant 2D video assets into high‑quality stereoscopic 3D space videos through a pipeline that includes monocular depth estimation, novel view synthesis, multi‑branch inpainting, and MV‑HEVC encoding, validated by ICME 2025 and a new StereoV1K dataset.

JD Cloud Developers
JD Cloud Developers
JD Cloud Developers
How AI Turns 2D Videos into Immersive 3D Spatial Content at Scale

In recent years, the rapid growth of social media, streaming platforms, and XR devices has driven a surge in demand for immersive 3D spatial video, especially in short‑form, live‑stream, and film domains. While consumer demand rises, the supply side faces bottlenecks due to scarce professional 3D capture hardware, high production complexity, and cost.

JD Retail’s content product R&D team proposes an innovative method that leverages 3D vision and AIGC to continuously transform existing 2D video resources into 3D spatial video, dramatically reducing production cost and increasing coverage. The approach has been accepted by the flagship multimedia conference ICME 2025 and deployed in JD.Vision video channels.

Research accepted by ICME 2025
Research accepted by ICME 2025

Technical Solution

3D spatial video generation is a novel view synthesis task that renders images for a target pose given a source view. State‑of‑the‑art solutions include NeRF, Gaussian Splatting, and Diffusion models. Unlike generic view synthesis, 3D spatial video must provide a left‑eye and a right‑eye view with a fixed pose offset, requiring algorithms to generate the right‑eye frame from a single left‑eye input.

The end‑to‑end pipeline consists of three core modules: monocular depth estimation, novel view synthesis (including disparity computation, warping, and hole‑filling), and MV‑HEVC encoding. The overall architecture is illustrated in Figure 2.

3D spatial video generation architecture
3D spatial video generation architecture

1.1 Monocular Depth Estimation

Depth estimation is a fundamental computer‑vision problem that infers scene geometry from images or video. It underpins AR/VR, robot navigation, and autonomous driving. Traditional approaches rely on hardware (TOF, LiDAR) or stereo matching; monocular methods are cheaper but more challenging. Recent advances have moved from classic algorithms to deep learning and, finally, to large‑model or generative techniques.

Our solution adopts DINO v2 as the backbone, integrates a DPT head, and introduces a multi‑frame memory bank with attention mechanisms to improve temporal stability and accuracy. The architecture is shown in Figure 4.

Monocular depth estimation architecture
Monocular depth estimation architecture

1.2 Novel View Synthesis

Novel view synthesis aims to generate new viewpoints from limited source views, a key technology for VR, AR, film VFX, and games. Although NeRF, Gaussian Splatting, and diffusion models have made progress, they face challenges such as scene‑specific modeling and temporal inconsistency. For our use case, we only need a fixed‑pose right‑eye image, so we adopt a depth‑warp followed by an InPaint hole‑filling strategy.

After obtaining depth, we compute a disparity map, warp the left‑eye video to the target right‑eye pose, and fill missing regions with a custom InPaint framework, yielding the final right‑eye view (Figure 6).

End‑to‑end novel view synthesis architecture
End‑to‑end novel view synthesis architecture

1.2.1 Multi‑branch InPaint Module

To achieve high‑quality and consistent filling, we employ three branches: a traditional polygon‑based interpolator (stable but prone to edge artifacts), a deep‑learning‑based neural network (good edge quality but less stable), and a disparity‑extension strategy (reduces stretching and foreground leakage). By fusing the complementary strengths of these branches, we obtain superior results (Figure 7).

Disparity extension strategy effectiveness
Disparity extension strategy effectiveness

1.2.2 StereoV1K Dataset

Existing public 3D video datasets suffer from low resolution, limited scenes, and poor realism. To address this, we built StereoV1K, the first high‑quality real‑world stereoscopic video benchmark. Using Canon RF‑S7.8mm F4 STM dual lenses and EOS R7 cameras, we captured 1,000 videos (1180×1180, ~20 s, 50 fps), totaling over 500 k frames. StereoV1K will serve as a standard benchmark for the community (Figure 9).

StereoV1K dataset comparison
StereoV1K dataset comparison

1.3 MV‑HEVC Encoding

Generated stereoscopic videos double the data volume of 2D video, making efficient compression essential. Traditional SBS‑HEVC simply concatenates left and right frames, resulting in low compression efficiency. MV‑HEVC encodes multiple views in a single bitstream, exploiting inter‑view redundancy for higher compression. Our MV‑HEVC extension reduces BD‑Rate by 33.28 % and speeds up encoding by 31.62 % compared to SBS‑HEVC (Figures 10‑11).

SBS‑HEVC vs MV‑HEVC comparison
SBS‑HEVC vs MV‑HEVC comparison
RD performance of SBS‑HEVC and MV‑HEVC
RD performance of SBS‑HEVC and MV‑HEVC

1.4 Applications and Deployment

To bring the technology to production, we trimmed the depth model using ViT‑S and fine‑tuned it with supervised fine‑tuning. The InPaint backbone was replaced with a lightweight transformer, trained on StereoV1K. Tests on large‑scale video streams show the generated 3D content meets business needs, though occasional artifacts remain. Future work focuses on further speed‑quality trade‑offs.

3D spatial videos are now viewable on Vision Pro, Pico, Quest, and AI glasses. JD.Vision leverages the pipeline to convert 2D product shorts, promos, and launch videos into immersive stereoscopic experiences, enhancing user engagement across e‑commerce and advertising scenarios.

Future Outlook

2.1 AIGC 3D/4D

Since 2024, AIGC for 3D/4D has accelerated, with approaches such as Google’s CAT3D, InstantMesh, and Trellis pushing multi‑view diffusion and structured 3D representations. Editable 3D generation and controllable synthesis are emerging research directions (Figure 13).

AIGC 3D/4D generation examples
AIGC 3D/4D generation examples

2.2 World Models

World models aim to capture spatio‑temporal structure with dense semantic representation and editability, enabling realistic scene reconstruction, creation, and prediction. Recent works like World Labs and Meta Orion illustrate the potential of such models for immersive AR/VR experiences (Figure 15).

World Labs and Meta Orion AI glasses
World Labs and Meta Orion AI glasses

References

Zhang J, Jia Q, Liu Y, et al. SpatialMe: Stereo Video Conversion Using Depth‑Warping and Blend‑Inpainting. arXiv preprint arXiv:2412.11512, 2024.

Yang, Sung‑Pyo, et al. Optical MEMS devices for compact 3D surface imaging cameras. Micro and Nano Systems Letters 7 (2019): 1‑9.

Bhat S F, Birkl R, Wofk D, et al. Zoedepth: Zero‑shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023.

Lihe Yang, Bingyi Kang, Zilong Huang, et al. Depth Anything V2. arXiv preprint arXiv:2406.09414, 2024.

Teed Z, Deng J. RAFT: Recurrent All‑Pairs Field Transforms for Optical Flow. ECCV 2020.

Zhang K, Fu J, Liu D. Flow‑guided transformer for video inpainting. European Conference on Computer Vision, 2022.

Shangchen Zhou, Chongyi Li, Kelvin CK Chan, and Chen Change Loy. Propainter: Improving propagation and transformer for video inpainting. ICCV 2023.

Han Y, Wang R, Yang J. Single‑view view synthesis in the wild with learned adaptive multiplane images. ACM SIGGRAPH 2022.

Wang L, Frisvad J R, Jensen M B, et al. Stereodiffusion: Training‑free stereo image generation using latent diffusion models. CVPR 2024.

Zhen Lv, Yangqi Long, Congzhentao Huang, et al. SpatialDreamer: Self‑supervised stereo video synthesis from monocular input. arXiv preprint arXiv:2411.11934, 2024.

Mildenhall B, Srinivasan P P, Tancik M, et al. NeRF: Representing scenes as neural radiance fields for view synthesis. CACM 2021.

Kerbl B, Kopanas G, Leimkühler T, et al. 3D Gaussian Splatting for real‑time radiance field rendering. ACM TOG 2023.

Gao R, Holynski A, Henzler P, et al. CAT3D: Create anything in 3D with multi‑view diffusion models. arXiv preprint arXiv:2405.10314, 2024.

Xu J, Cheng W, Gao Y, et al. InstantMesh: Efficient 3D mesh generation from a single image with sparse‑view large reconstruction models. arXiv preprint arXiv:2404.07191, 2024.

Xu Z, Xu Y, Yu Z, et al. Representing long volumetric video with temporal Gaussian hierarchy. ACM TOG 2024.

Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.

Xiang J, et al. Structured 3D latents for scalable and versatile 3D generation. arXiv preprint arXiv:2412.01506, 2024.

Wu G, Yi T, Fang J, et al. 4D Gaussian splatting for real‑time dynamic scene rendering. CVPR 2024.

Shih‑Li M, Shih‑Yang S, Johannes K, Jia‑Bin H. 3D photography using context‑aware layered depth inpainting. CVPR 2020.

computer visionAIGCDepth estimation3D videoMV-HEVCnovel view synthesisStereoV1K
JD Cloud Developers
Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.