Artificial Intelligence 8 min read

Real-time Monocular Human Depth Estimation and Segmentation on Embedded Systems (HDES-Net)

The paper presents HDES‑Net, a lightweight real‑time monocular human depth estimation and segmentation network designed for embedded platforms, using MobileNetV1 backbone with ASPP and depth‑wise separable convolutions, achieving high accuracy on CAD‑60 and EPFL‑RGBD datasets while running at up to 199.93 FPS on a Tesla P40 and 17.23 FPS on a Jetson Nano after TensorRT optimization.

JD Retail Technology
JD Retail Technology
JD Retail Technology
Real-time Monocular Human Depth Estimation and Segmentation on Embedded Systems (HDES-Net)

Recently, a paper from JD Retail Technology and the Shared Technology Department was accepted by the prestigious IEEE International Conference on Intelligent Robots and Systems (IROS2021). The work, titled Real-time Monocular Human Depth Estimation and Segmentation on Embedded Systems , introduces a novel and simple network architecture called HDES‑Net for simultaneous human segmentation and depth estimation.

The proposed network targets applications such as scene reconstruction, augmented reality, and object detection, where fast and accurate depth estimation is challenging. By using a monocular camera, the method keeps cost and power consumption low.

The network consists of three main parts:

(1) Encoder: To balance accuracy and speed, MobileNetV1 is used as the backbone instead of heavier VGG‑16 or ResNet‑50. An Atrous Spatial Pyramid Pooling (ASPP) module with multiple receptive fields is added after MobileNetV1 to capture multi‑scale features without excessive computation.

(2) Decoder: The decoder contains two up‑sampling layers (8× and 2×) and several depth‑wise separable convolutions that reduce the channel count to 96. Two parallel branches predict human depth and segmentation, both employing depth‑wise separable and standard convolutions followed by up‑sampling. Segmentation features are fused into the depth branch to improve depth accuracy. Loss functions include smooth L1 for segmentation and cross‑entropy for depth.

(3) Optimization and Acceleration: The ASPP module is refined by keeping dilation rates 1 and 9 and replacing the other two rates with 5. Global average pooling is removed to better handle varying human sizes. Finally, TensorRT SDK is used for further acceleration.

Experiments compare HDES‑Net with several representative methods on CAD‑60, CAD‑120, and EPFL‑RGBD datasets. The results show superior performance on CAD‑60 and EPFL‑RGBD, achieving real‑time human segmentation and depth estimation with a single monocular camera. The method performs slightly worse on CAD‑120 due to its broader depth range, which is outside the primary focus on indoor pedestrians.

Performance benchmarks indicate that the network reaches 199.93 FPS on a Tesla P40 GPU and 17.23 FPS on a Jetson Nano GPU. With additional TensorRT optimizations, the Jetson Nano can achieve 114.16 FPS, fully satisfying real‑time requirements.

Qualitative results demonstrate precise human segmentation and accurate depth estimation across various scenes.

Beyond the paper, the authors discuss the broader potential of depth estimation in AR applications, citing Google ARCore’s Depth API, games like Five Nights at Freddy’s AR, Snap’s depth‑based filters, remote‑video AR annotations, interior design, autonomous driving, 3D reconstruction, and more.

Recruitment Notice: The Visual Algorithm team, focused on AR perception algorithms, is hiring computer‑vision, deep‑learning engineers, and researchers. Interested candidates can send their resumes to [email protected] .

Depth Estimationhuman segmentationEmbedded AIHDES-Netreal-time vision
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.