Artificial Intelligence 12 min read

Deep Reinforcement Learning for Route Planning in DiDi Ride‑Hailing

DiDi’s route engine, handling over 40 billion daily requests, replaces static graph algorithms with a deep‑reinforcement‑learning system that first learns intersection decisions via behavior‑cloning LSTM models and then refines them through self‑play Q‑learning, using beam‑search decoding to produce globally optimal, low‑deviation routes for ride‑hailing.

Didi Tech

Oct 10, 2020

Deep Reinforcement Learning for Route Planning in DiDi Ride‑Hailing

DiDi's route engine processes over 40 billion routing requests daily, making high‑quality path planning critical for driver and passenger experience.

The road network is modeled as links with unique IDs and attributes; Beijing alone contains about 2 million links and orders may involve dozens of intersections, making optimal routing under strict time constraints challenging.

Unlike simple distance‑or‑time optimization, ride‑hailing must balance travel time, distance, price, and safety, offering passengers multiple route options and allowing drivers to follow platform‑specified routes.

Traditional static‑weight graph algorithms (e.g., Dijkstra) are insufficient for dynamic traffic conditions. DiDi currently uses a two‑stage pipeline: a graph‑based coarse ranking followed by a machine‑learning‑based re‑ranking.

Leveraging massive trajectory data, DiDi explores a deep‑reinforcement‑learning (DRL) approach to generate routes directly.

First, a behavior‑cloning model treats each intersection decision as a classification problem, using expert (historical) trajectories as positive samples. An LSTM network predicts the probability of each possible link, achieving >98 % decision‑accuracy at individual intersections.

To avoid locally optimal but globally sub‑optimal routes, a beam‑search decoder maintains a set of top‑k candidate paths, similar to techniques used in neural machine translation.

Behavior cloning alone suffers from distribution shift: when the agent makes a wrong decision, it may enter unseen states and accumulate errors, leading to “one‑step‑wrong‑all‑wrong” trajectories.

DiDi therefore augments the model with reinforcement learning. After pre‑training with behavior cloning, the agent generates self‑play trajectories; matching the user’s action yields a +1 reward, mismatches yield 0. This Q‑learning‑style reward avoids adversarial training and reduces computational cost.

The iterative self‑play pipeline continuously refines the policy network, improving trajectory overlap with real trips and accelerating recovery from off‑policy deviations.

Overall, the DRL‑based route generation reduces deviation (偏航) rates and demonstrates the feasibility of end‑to‑end AI‑driven routing for large‑scale ride‑hailing platforms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

ai Reinforcement Learning Beam Search behavior cloning Route Planning

Written by

Didi Tech

Official Didi technology account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.