Artificial Intelligence 14 min read

How Kuaishou’s Y‑Tech Advances Monocular Depth Estimation for Mobile AR

This article reviews Kuashou Y‑Tech’s ECCV‑2020 paper on monocular depth estimation, detailing its novel GCB‑SAB network, new HC‑Depth dataset, specialized loss functions and edge‑aware training, and demonstrates superior performance on NYUv2, TUM and real‑world mobile AR applications.

Kuaishou Large Model
Kuaishou Large Model
Kuaishou Large Model
How Kuaishou’s Y‑Tech Advances Monocular Depth Estimation for Mobile AR

Overview

Kuaishou Y‑Tech presents a research paper (ECCV 2020) that proposes a high‑quality monocular depth estimation method, enabling 3D scene understanding on mobile devices. The method powers new experiences such as 3D Photo and mixed reality without requiring special hardware.

Challenges in Monocular Depth Estimation

Estimating depth from a single image faces difficulties such as poor lighting, moving subjects, sky regions, false edges, and camera motion. Existing methods treat depth prediction as pixel‑wise classification or regression, ignoring global structural relationships, which leads to layout errors and blurred edges.

Network Architecture

The proposed model follows an encoder‑decoder (U‑shape) design with skip connections. It introduces two novel modules:

Global Context Block (GCB) : recalibrates channel features by embedding global semantic context.

Spatial Attention Block (SAB) : a spatial attention mechanism that guides feature selection at multiple scales.

Low‑resolution SAB features provide global layout cues, while high‑resolution SAB features emphasize fine details. The fused multi‑scale features are up‑sampled to produce the final depth map.

Spatial Attention Block Details

SAB uses a 1×1 convolution to squeeze concatenated features, aggregates spatial context, and generates an attention map that encodes depth information for every pixel. The attention map is multiplied element‑wise with low‑level features before fusion, allowing the network to re‑calibrate GCB‑enhanced semantic features with spatially aware weights.

Training Losses

The loss function combines four components:

Berhu loss (open‑source)

Scale‑invariant gradient loss

Normal loss

Global Focal Relative Loss (GFRL) – a novel relative loss that incorporates focal loss weighting to emphasize hard pixel pairs.

GFRL samples one pixel from each 16×16 block and compares it with all other pixels in the same image. The weighting factor reduces the influence of easy pairs and focuses training on incorrectly ordered depth relationships.

Edge‑Aware Consistency

To improve depth discontinuities, the method applies an edge‑aware strategy: Canny edges are extracted from the predicted depth map, dilated to form boundary regions, and a higher training weight is assigned to these regions, encouraging sharper depth boundaries.

Multi‑Dataset Incremental Training

The authors train on multiple datasets using an incremental mixing strategy. After converging on a dataset with a similar distribution, harder datasets are added one by one, with a balanced sampler ensuring equitable batch composition. This accelerates convergence and improves generalization.

Results and Comparisons

On the NYUv2 benchmark, the proposed method outperforms state‑of‑the‑art approaches in both quantitative metrics and visual quality. Similar superiority is observed on the TUM dataset (unseen scenes) and on the newly collected HC‑Depth dataset, which contains six challenging scene categories.

Real‑World Applications at Kuaishou

The depth estimation technology powers several mobile features:

Mixed Reality (MR) : Combines monocular depth with SLAM/VIO to enable real‑time occlusion, virtual lighting, and physical collisions on phones.

3D Photo : Generates immersive 3D effects from a single image using dense reconstruction, portrait segmentation, and background inpainting.

Depth‑of‑Field Blur : Uses depth maps and portrait segmentation to simulate large‑aperture bokeh on mobile cameras.

All models run on‑device via the Y‑Tech YCNN inference engine, ensuring broad device compatibility.

Y‑Tech Team Introduction

Y‑Tech is Kuaishou’s AI research group focusing on computer vision, graphics, machine learning, and AR/VR. The team operates in Beijing, Shenzhen, Hangzhou, Seattle, and Palo Alto, and welcomes collaborations via [email protected].

computer visiondeep learningattention mechanismmobile ARmonocular depth estimation
Kuaishou Large Model
Written by

Kuaishou Large Model

Official Kuaishou Account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.