Big Data 10 min read

How Didi Built a Scalable Real‑Time Computing Platform with Spark, Flink, and StreamSQL

This article outlines Didi's journey from fragmented, self‑built real‑time clusters to a unified, YARN‑managed platform that leverages Spark Streaming, Flink, and StreamSQL, detailing architectural choices, resource isolation, CEP enhancements, and the resulting impact on latency‑critical services.

dbaplus Community
dbaplus Community
dbaplus Community
How Didi Built a Scalable Real‑Time Computing Platform with Spark, Flink, and StreamSQL

1. Evolution of Real‑Time Computing

Didi’s ride‑hailing service generates massive real‑time data, leading to three architectural stages:

Business units built independent small clusters (using Storm, JStorm, Spark Streaming, Samza).

A centralized large‑scale platform on YARN.

A SQL‑centric stage based on Flink SQL.

Self‑built clusters suffered low resource utilization, poor monitoring, high maintenance effort, and limited knowledge sharing. In early 2017 Didi consolidated the clusters, selecting Spark Streaming (micro‑batch) on YARN with multi‑tenant isolation, authentication, and billing.

2. Platform Architecture

To meet stricter stability requirements, Didi introduced a two‑layer isolation model:

Container‑level isolation using Linux CGroup for CPU and memory limits.

Physical‑machine level isolation via YARN node labels. The FairScheduler was extended to support node labels, allowing ordinary jobs to share a label while special jobs run on dedicated machines.

The platform provides a web‑based StreamSQL IDE, task‑level monitoring & alarm, diagnostic dashboards (traffic, checkpoint, GC, resource usage), lineage tracking, and multi‑tenant task management (submission, start/stop, asset management).

3. Real‑Time Rule Matching (CEP) Service

Didi needed to detect complex temporal patterns such as “a passenger bubbles for 10 seconds without ordering”. The open‑source CEP lacked a descriptive language and dynamic rule updates, so Didi extended it with:

A wait operator to express temporal dependencies.

A DSL built on Groovy and Aviator for rule definition.

Support for multiple rules per job and hot‑rule updates.

Performance optimizations included:

Rebuilding SharedBuffer using Flink MapState to reduce state interaction.

Adding an access cache (contributed upstream) to delay reference‑count updates.

Simplifying event‑time handling to avoid full key scans on watermark updates.

Reusing conditionContext to avoid repeated partial‑match lookups (contributed upstream).

These changes yielded multi‑order‑of‑magnitude speedups and enabled the CEP service to power personalized ride‑hailing operations and real‑time anomaly detection.

4. StreamSQL Development

Inspired by Hive’s role in batch processing, Didi launched StreamSQL in 2018 to lower the barrier for stream development. Built on Flink SQL, StreamSQL adds:

Extended DDL that integrates internal message queues (e.g., Kafka, Pulsar) and real‑time storage systems, with built‑in parsers for JSON, binlog, and log formats.

Rich UDF library covering string, date, map, and spatial processing.

Split‑stream syntax allowing a single source to emit multiple output streams (implemented by extending Calcite).

TTL‑based join semantics that replace window joins with state TTL, supporting both stream‑stream and stream‑dimension joins.

StreamSQL IDE – a web UI offering syntax checking, debugging, and diagnostics.

Combined with the platform’s monitoring, alarm, and task control features, StreamSQL has already reduced development effort dramatically and is expected to handle roughly 80 % of Didi’s streaming workload.

5. Conclusion

The need for low‑latency, high‑throughput real‑time computation drove Didi to build a centralized YARN‑managed platform, migrate latency‑critical jobs from Spark Streaming to Flink, create a high‑performance CEP service, and launch StreamSQL. Future work includes extending StreamSQL toward batch‑stream convergence, IoT ingestion, and real‑time machine‑learning pipelines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

CEPFlinkReal-time StreamingStreamSQLSpark StreamingResource Isolation
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.