How Kuaishou’s Y‑Tech Achieved Real‑Time 3D Photo Rendering on Any Smartphone
The article details Kuaishou Y‑Tech’s end‑to‑end solution for converting a single RGB image into an interactive 3D photo on mobile devices, covering depth estimation, image‑inpainting, custom KwaiNN inference, and real‑time 3D rendering techniques that run on all smartphone models without depth sensors.
3D Photo Overview
The Kuaishou Y‑Tech team proposes a method to transform a single RGB image into a dynamic 3D photo in real time on mobile devices, leveraging learning‑based depth estimation and image‑inpainting together with their proprietary KwaiNN inference engine and SKwai 3D effects engine.
Algorithm Framework Overview
Generating a 3D photo requires accurate scene depth, occlusion handling, and efficient mobile execution. The main challenges are (1) universal scene depth estimation that preserves facial detail and overall scene geometry, (2) high‑quality image and depth inpainting for large occluded regions, and (3) real‑time performance across diverse phone hardware.
General scene depth estimation : produce high‑quality depth maps for indoor and outdoor scenes, balancing facial fidelity and scene scale.
Universal image repair : recover missing regions of arbitrary size with high visual fidelity.
Reconstruction and rendering : rebuild the scene, design camera trajectories, and render new views.
Mobile‑side real‑time operation : ensure all modules run efficiently on the device.
Core Steps
Predict portrait segmentation and monocular depth using custom models; refine facial depth with a dedicated 3‑D face reconstruction pipeline and fuse with scene depth.
Apply portrait‑aware image inpainting to synthesize background content for occluded areas, then use Poisson diffusion to fill missing depth.
Reconstruct foreground and background meshes, generate continuous virtual camera paths, and render new views with the 3‑D graphics engine.
Monocular Depth Estimation
A U‑shaped encoder‑decoder network with skip connections extracts semantic and spatial features. Global context blocks (GCB) recalibrate channel features, and a spatial attention block (SAB) modulates local region weights. Multi‑task training jointly learns depth, surface normals, and portrait segmentation, improving both scene and facial depth accuracy.
Image and Depth Repair
Portrait segmentation isolates the subject, after which a custom inpainting model restores occluded background pixels. Poisson diffusion then propagates depth values into the repaired region, yielding separate foreground/background layers with consistent depth maps.
3D Scene Reconstruction and Rendering
Using the fused depth and repaired background, the system performs adaptive foreground‑background mesh reconstruction. The reconstructed data is fed to the proprietary 3‑D graphics engine, enabling smooth camera motions, gyroscope‑controlled view changes, and optional visual effects such as particles, rain, and atmospheric fog.
Mobile KwaiNN Inference Engine
KwaiNN, the upgraded version of YCNN, is a mobile‑first AI inference engine that supports CPUs, GPUs (Mali, Adreno, Apple, NVIDIA) and NPUs (Apple Bionic, Huawei HiAI, Qualcomm SNPE, MediaTek APU). It handles CNN/RNN models in float32, float16, and uint8 precision, with hardware‑specific operators (Metal, OpenCL, NEON) and a full toolchain for PyTorch/TFLite conversion, quantization, and architecture search, delivering roughly 10% performance advantage over competing engines.
Conclusion
The presented 3D Photo pipeline combines high‑quality monocular depth estimation, robust image/depth inpainting, and an optimized KwaiNN inference stack to deliver the first real‑time mobile 3D photo experience that works on virtually all smartphones without requiring dedicated depth sensors.
Kuaishou Large Model
Official Kuaishou Account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.