Big Data 19 min read

Design and Optimization of Didi's Spatial‑Temporal Supply‑Demand System

Didi’s redesigned Spatial‑Temporal Supply‑Demand System replaces a single‑Redis bottleneck with a multi‑cluster routing layer, semantic sharding, multi‑level caching and delayed queues, achieving higher horizontal scalability, fault isolation, ~30 % latency reduction, increased cache hit rates, fewer query nodes, and faster, code‑free feature configuration.

Didi Tech
Didi Tech
Didi Tech
Design and Optimization of Didi's Spatial‑Temporal Supply‑Demand System

Background

The Spatial‑Temporal Supply‑Demand System (SDS) was built to support Didi’s ride‑hailing business by calculating and storing massive supply‑demand features at various spatial (grid, district, city) and temporal (instant, minute, hour) granularities. These features feed real‑time algorithm models and are also persisted for offline training.

System Framework Evolution

2.1 Limitations of the legacy framework

Single‑Redis cluster limits horizontal scaling; expansion benefits diminish as the cluster grows.

Lack of failover – a cluster‑level outage leads to long service recovery times.

Performance bottlenecks: query QPS > 5 million, fan‑out QPS up to 8 million, with 15 ms SLA; p99 latency exceeds SLA during peaks.

R&D efficiency suffers because complex feature semantics require custom code, extending development cycles.

2.2 Advantages of the new framework

Storage layer now supports multiple Redis clusters via a routing layer, improving horizontal scalability and fault isolation.

Multi‑level caching, feature‑compute separation, and delayed‑queue replacement for scheduled tasks reduce load spikes and improve latency.

Component‑oriented feature production enables full‑process configuration, dramatically shortening iteration time.

System Construction Thoughts

3.1 Storage Governance

The new architecture splits the original Redis cluster into several smaller clusters and introduces a routing layer that maps feature keys to target clusters. This design achieves:

Better horizontal scalability of the storage layer.

Higher availability – failures are isolated to individual clusters, and hot‑updates enable seamless failover.

Data‑sharding strategies considered:

Hash‑based sharding : balances data size but can cause massive fan‑out when a feature is queried across many spatial‑temporal dimensions.

Semantic‑based sharding : groups features by their semantic tag, reducing fan‑out and providing “fast‑slow” isolation, at the cost of higher configuration overhead.

Implementation details (configuration example):

{
  "data_parser": "json",
  "parser_conf": [
    {"field": "order_id", "jpath": "info.order_id", "type": "int"},
    {"field": "city_id", "jpath": "info.city.city_name", "type": "string"}
  ]
}

Another snippet shows rule‑engine configuration:

{
  "rule_engine": "default",
  "rule_engine_conf": "city_id == 'abc'"
}

3.2 Performance Optimization

Local cache on query nodes stores static features, cutting Redis request volume.

Pre‑aggregation of high‑QPS features reduces fan‑out to Redis.

Introduce delayed queues to smooth periodic tasks, eliminating request spikes.

Key results after optimization:

Feature query node count reduced by 20%.

Static‑feature cache hit rate increased by 20%.

Redis p99 latency dropped ~30% during peaks.

3.3 Development Efficiency – Configuration Capability Upgrade

The legacy system required extensive custom code for complex feature definitions. The new approach abstracts the feature production pipeline into reusable components (data parser, rule engine, etc.) and orchestrates them via declarative configuration, enabling:

Component‑level horizontal scaling.

Full‑process configuration without code changes.

Summary

Architecture decisions must align with business maturity. Early‑stage low‑traffic systems can rely on a single storage cluster, while mature, high‑traffic services benefit from multi‑cluster routing, component‑based feature production, and extensive performance tuning. Maintaining clean code, avoiding excessive allocations, and applying systematic profiling (e.g., Go pprof) are essential for sustaining performance at scale.

distributed systemsBig DatagolangConfiguration Managementperformance tuningStorage Optimization
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.