Big Data 9 min read

Evolution and Architecture of Didi's Real-Time Computing Platform

From early self‑built Storm and Spark Streaming clusters to a unified YARN‑based Spark platform and finally a low‑latency Flink system with extended CEP and StreamSQL capabilities, Didi’s real‑time computing platform evolved through three stages, delivering multi‑tenant isolation, rich SQL processing, and dramatically reduced development costs.

Didi Tech
Didi Tech
Didi Tech
Evolution and Architecture of Didi's Real-Time Computing Platform

Didi Chuxing, as an internet company in the ride‑hailing domain, relies on real‑time online services, generating massive real‑time data and computation scenarios. This article introduces the evolution of Didi’s real‑time computing platform and its architectural practices.

/01/ Real‑time Computing Evolution Didi’s real‑time computing architecture has gone through three stages: (1) business‑side self‑built small clusters using engines such as Storm, JStorm, Spark Streaming, Samza; (2) a centralized large cluster with platformization; (3) SQL‑based processing. The first stage suffered from low resource utilization, weak monitoring, high maintenance cost, and poor technology sharing.

In early 2017 Didi built a unified real‑time cluster and platform, selecting Spark Streaming on YARN for large‑scale data cleaning, and implementing multi‑tenant isolation, authentication, resource isolation, and billing. Two‑level isolation was introduced: CGroup‑based container isolation and physical‑machine isolation via YARN’s FairScheduler with node labels.

Later, to meet low‑latency requirements, Didi introduced Flink with native streaming, offering millisecond‑level latency and rich window functions. Flink powered the largest traffic‑gateway monitoring system and supported use cases such as passenger‑position change notifications and trajectory anomaly detection.

/02/ Real‑time Computing Platform Architecture The platform provides a StreamSQL IDE, monitoring and alerting, diagnostic dashboards, lineage tracking, and multi‑tenant task management, all accessible via a web interface.

/03/ Real‑time Rule Matching Service (CEP) Didi extended Flink CEP with a wait operator, a Groovy/Aviator‑based DSL, multi‑rule support with dynamic updates, and performance optimizations such as SharedBuffer reconstruction, access caching, simplified event‑time handling, and condition context reuse. These enhancements improved CEP performance by several orders of magnitude and are used in personalized operations and real‑time anomaly detection.

/04/ StreamSQL Construction Building on Flink SQL, Didi added extended DDL for internal message queues and storage systems, a rich set of UDFs (string, date, map, spatial), split‑stream syntax, TTL‑based join semantics, and a web‑based StreamSQL IDE with syntax checking, debugging, and diagnostics. StreamSQL is expected to handle about 80 % of Didi’s streaming workload.

/05/ Summary Over the past year, Didi has built a centralized real‑time computing platform, migrated from self‑built clusters to a unified Spark Streaming cluster, adopted Flink for low‑latency streaming, and delivered CEP and StreamSQL services that dramatically lower development cost. Future work includes expanding StreamSQL, exploring batch‑stream convergence, IoT, and real‑time machine learning.

CEPbig dataFlinkStreamingreal-time computingStreamSQLSpark Streaming
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.