Deep Reinforcement Learning for Route Planning in DiDi Ride‑Hailing
DiDi’s route engine, handling over 40 billion daily requests, replaces static graph algorithms with a deep‑reinforcement‑learning system that first learns intersection decisions via behavior‑cloning LSTM models and then refines them through self‑play Q‑learning, using beam‑search decoding to produce globally optimal, low‑deviation routes for ride‑hailing.
DiDi's route engine processes over 40 billion routing requests daily, making high‑quality path planning critical for driver and passenger experience.
The road network is modeled as links with unique IDs and attributes; Beijing alone contains about 2 million links and orders may involve dozens of intersections, making optimal routing under strict time constraints challenging.
Unlike simple distance‑or‑time optimization, ride‑hailing must balance travel time, distance, price, and safety, offering passengers multiple route options and allowing drivers to follow platform‑specified routes.
Traditional static‑weight graph algorithms (e.g., Dijkstra) are insufficient for dynamic traffic conditions. DiDi currently uses a two‑stage pipeline: a graph‑based coarse ranking followed by a machine‑learning‑based re‑ranking.
Leveraging massive trajectory data, DiDi explores a deep‑reinforcement‑learning (DRL) approach to generate routes directly.
First, a behavior‑cloning model treats each intersection decision as a classification problem, using expert (historical) trajectories as positive samples. An LSTM network predicts the probability of each possible link, achieving >98 % decision‑accuracy at individual intersections.
To avoid locally optimal but globally sub‑optimal routes, a beam‑search decoder maintains a set of top‑k candidate paths, similar to techniques used in neural machine translation.
Behavior cloning alone suffers from distribution shift: when the agent makes a wrong decision, it may enter unseen states and accumulate errors, leading to “one‑step‑wrong‑all‑wrong” trajectories.
DiDi therefore augments the model with reinforcement learning. After pre‑training with behavior cloning, the agent generates self‑play trajectories; matching the user’s action yields a +1 reward, mismatches yield 0. This Q‑learning‑style reward avoids adversarial training and reduces computational cost.
The iterative self‑play pipeline continuously refines the policy network, improving trajectory overlap with real trips and accelerating recovery from off‑policy deviations.
Overall, the DRL‑based route generation reduces deviation (偏航) rates and demonstrates the feasibility of end‑to‑end AI‑driven routing for large‑scale ride‑hailing platforms.
Didi Tech
Official Didi technology account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.