Artificial Intelligence 16 min read

Real-time Single-image 3D Photo Generation on Mobile Devices Using Deep Learning

The article presents a mobile‑first solution that converts a single RGB photograph into an interactive 3D photo by combining learning‑based monocular depth estimation, multi‑task image‑and‑depth restoration, face‑specific refinement, and a custom KwaiNN inference engine to achieve real‑time rendering on all smartphone models without requiring depth sensors.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
Real-time Single-image 3D Photo Generation on Mobile Devices Using Deep Learning

Fast‑hand Y‑tech introduces a method to transform a single 2D RGB image into a dynamic 3D photo in real time on mobile devices, leveraging deep‑learning‑based depth estimation and image restoration to perceive spatial context, and integrating the proprietary KwaiNN inference engine with the SKwai 3D effects engine for on‑device rendering.

Traditional 3D photo creation relied on manual editing tools (AE VoluMax, Photoshop) or required depth sensors (iPhone). Recent works from Adobe, Facebook, and Snapchat introduced 3D Photo algorithms but still faced challenges such as accurate depth prediction, occlusion handling, and high computational cost.

The proposed pipeline addresses two core challenges: (1) universal scene depth estimation that delivers high‑quality depth maps for both indoor and outdoor scenes, and (2) robust image‑and‑depth inpainting for large occluded regions. The solution uses a multi‑task U‑Shaped network with skip connections, a Global Context Block (GCB) for channel recalibration, and a Spatial Attention Block (SAB) for local weighting. The network jointly predicts depth, surface normals, and portrait segmentation, trained on a near‑10‑million indoor‑outdoor dataset.

For portrait images, a dedicated face‑keypoint detector feeds a custom 3‑D face reconstruction module, producing fine‑grained facial depth that is fused with the scene depth to maintain scale consistency. The pipeline also includes a portrait segmentation stage, followed by a two‑stage image‑inpainting model that first restores coarse structures at low resolution and then refines details at full resolution, with Poisson diffusion applied to depth maps.

Model compression techniques (quantization, distillation) reduce the inference footprint to as low as 100 MB, enabling real‑time execution across a wide range of CPUs, GPUs, and NPUs (Apple Bionic, Qualcomm Snapdragon, Huawei HiAI, MediaTek APU). The KwaiNN engine provides optimized operators (Metal, OpenCL, NEON) and supports mixed‑precision (float32, float16, uint8) for CNN and RNN architectures.

After depth and background reconstruction, a 3D mesh is generated and rendered by the SKwai engine, supporting various camera trajectories (zoom, rotation, gyroscope‑controlled) and visual effects such as particles, halos, rain, and atmospheric fog. The system has been deployed in Kuaishou’s main app, Kuaishou Short Video, and Snack Video, achieving broad device coverage and notable engagement uplift.

References include recent works on monocular depth estimation, taskonomy, and existing 3D photo solutions from Adobe, Facebook, and Snapchat.

mobile AIARimage restorationmonocular depth estimation3D PhotoKwaiNN
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.