Technical Overview of DiDi's AR Indoor Navigation System
DiDi's AR indoor navigation system addresses GPS unreliability in large indoor venues by using SfM-based 3D reconstruction, robust visual localization with magnetometer/GNSS priors, and sensor fusion with pedestrian dead‑reckoning and deep‑learning heading estimation, cutting passenger pick‑up time by up to 25 % across dozens of airports and malls.
In large indoor venues such as airports, malls and train stations, GPS signals are unstable, the area is vast and routes are complex, making it difficult for passengers to locate the pick‑up point after placing a ride‑hailing order. To address this, DiDi developed an AR‑based real‑world navigation product that combines 3‑D reconstruction, visual positioning and augmented‑reality techniques.
Application Background
User research showed that passengers often spend extra time finding the boarding point in indoor environments where GPS is inaccurate. DiDi first introduced an “image‑and‑text” guide, then explored a more intuitive AR solution, resulting in the DiDi AR navigation product.
Problem Analysis
The challenges include (1) building a map of the indoor scene, (2) determining the user’s position, and (3) providing an intuitive guidance method. Indoor GPS is unreliable; Wi‑Fi and cellular positioning suffer from large errors; the environment is large, repetitive and dynamic, making traditional outdoor navigation pipelines unsuitable.
Technical Challenges
Three core technical challenges were identified:
3‑D reconstruction for large indoor spaces.
Robust visual localization in repetitive, dynamic environments.
Accurate sensor‑based position estimation despite inertial drift.
Key Solutions
1. Vision‑based 3‑D Reconstruction
We adopt Structure‑from‑Motion (SfM) to recover scene geometry from multiple images or video streams. The pipeline includes data acquisition, feature extraction, data association, and bundle‑adjustment optimization. For large venues, we propose a block‑wise reconstruction method that builds an association graph, partitions it via a graph‑cut model, and merges blocks with pose‑graph optimization, achieving a 70 % efficiency gain and producing one of the largest indoor 3‑D models (e.g., Zhengzhou Airport).
2. Visual Localization
We rely on camera‑based visual positioning instead of GNSS, Wi‑Fi or Bluetooth. The pipeline extracts image features, retrieves top‑N candidate images from the 3‑D model, matches 2‑D to 3‑D points, and solves pose with RANSAC + PnP. To reduce mismatches caused by repetitive signs, we incorporate coarse priors from magnetometer and GNSS, perform clustering, and re‑rank candidates with weighted scores.
3. Sensor‑based Position Estimation
After visual pose is obtained, we fuse inertial sensor data (accelerometer, gyroscope, magnetometer) with a pedestrian dead‑reckoning (PDR) framework. We improve step detection using gait‑intensity thresholds, estimate stride length with statistical features, and refine heading with a deep learning model (LSTM + ResNet) that outputs a heading‑confidence estimate. A gradient‑boosted decision‑tree classifier distinguishes walking, idle and device‑shaking states to adapt the PDR parameters.
Summary
The AR navigation system has been deployed in more than 24 airports, malls and stations (e.g., Zhengzhou, Shenzhen, Tokyo). Field tests show a reduction of up to 25 % in the time needed to reach the pick‑up point, demonstrating the effectiveness of combining AI‑driven visual SLAM, sensor fusion and AR rendering for indoor ride‑hailing scenarios.
Didi Tech
Official Didi technology account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.