Artificial Intelligence 12 min read

Deep Learning‑Based 2D‑to‑3D Conversion for VR Content

iQIYI’s deep‑learning pipeline converts single‑view images into high‑quality stereo pairs for VR by training on side‑by‑side 3D movies, employing a Monodepth‑based encoder‑decoder, a CVAE to encode camera parameters, ConvLSTM for temporal consistency, and disparity‑guided inpainting to fill occlusion holes, achieving stable, continuous depth maps validated through extensive human 3‑D effect assessments.

iQIYI Technical Product Team

May 8, 2020

Deep Learning‑Based 2D‑to‑3D Conversion for VR Content

With the arrival of the 5G era, VR applications have exploded, and high‑quality 3D content is in great demand. iQIYI, a leading Chinese internet video platform, is researching 2D‑to‑3D conversion technology to build a richer VR 3D ecosystem.

Compared with 2D, good 3D content can reproduce realistic depth relationships, providing a superior viewing experience. The main technical challenges are the high cost of conversion, difficulty in modeling real‑world disparity across diverse scenes, and the lack of high‑quality 3D datasets.

Challenges

Dataset quality : many stereo pairs in 3D media do not follow true disparity; camera parameters cause inconsistent disparity across similar scenes.

Inter‑frame jitter : ensuring temporal continuity and accuracy of disparity predictions for diverse scenes.

Evaluation metrics : 3D quality is often judged subjectively by humans.

To address these issues, a deep‑learning approach is adopted. Large amounts of side‑by‑side 3D movies are used to learn real disparity and train a model that converts a single‑view (monocular) image into a stereo pair.

Model Prototype

The prototype follows binocular vision principles. Using the disparity‑depth relationship (Equation 1) and a learned mapping function F (Equation 2), depth can be inferred from a single left view and known camera parameters (baseline b, focal length f).

3D介质中包含大量不符合真实视差关系的双目视图</code><code>受相机参数的影响，同类场景的视差在不同的 3D 介质中不统一

Monodepth is selected as the baseline because it fully exploits binocular information during training while requiring only a monocular image at inference time. The architecture (Figure 2) shows a dual‑branch encoder that processes left‑eye images and a disparity‑guided decoder.

Model Evolution

Camera‑parameter problem : Training on mixed 3D movie datasets leads to instability because different movies use different camera rigs. A Conditional Variational Auto‑Encoder (CVAE) is introduced to encode camera parameters as posterior information and inject them into disparity prediction via AdaIN, following a “dual‑stage” training strategy.

Jitter problem : Temporal jitter in consecutive frames is mitigated by a ConvLSTM‑based module. However, excessive ConvLSTM layers increase training complexity, so a balanced design is adopted.

Hole‑filling problem : New viewpoints expose previously occluded regions, creating holes. An image‑inpainting module guided by the predicted disparity map is used to fill these gaps.

The final system predicts stable, continuous disparity maps, which can also be converted to relative depth maps for applications such as 3D posters (Figures 5‑6).

Extensive human evaluation is performed to assess the 3D effect of the generated content.

References

[1] Xie J, Girshick R, Farhadi A. Deep3D: Fully automatic 2D‑to‑3D video conversion with deep convolutional neural networks. CVPR 2016. [2] Garg R et al. Unsupervised CNN for single view depth estimation: Geometry to the rescue. ECCV 2016. [3] Godard C et al. Unsupervised monocular depth estimation with left‑right consistency. CVPR 2017. [4] Zhou T et al. Unsupervised learning of depth and ego‑motion from video. CVPR 2017. [5] Huang X, Belongie S. Arbitrary style transfer in real‑time with adaptive instance normalization. ICCV 2017. [6] Zhu JY et al. Toward multimodal image‑to‑image translation. NeurIPS 2017. [7] Zhang H et al. Exploiting temporal consistency for real‑time video depth estimation. ICCV 2019. [8] Tananaev D et al. Temporally consistent depth estimation in videos with recurrent architectures. ECCV 2018. [9] Lin J et al. TSM: Temporal shift module for efficient video understanding. ICCV 2019. [10] Wang TC et al. Video‑to‑video synthesis. arXiv 2018. [11] Yu J et al. Free‑form image inpainting with gated convolution. ICCV 2019.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

VR 2D-to-3D Depth estimation stereo vision

Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.