Big Data 14 min read

Didi's Real-Time Computing Practices with Apache Flink: Architecture, StreamSQL, and Operational Insights

Senior Didi technology expert Liang Li-yin shares how Didi leverages Apache Flink for large‑scale real‑time computing, covering service architecture, StreamSQL advantages, multi‑cluster management, task control, monitoring, meta‑store integration, challenges, and future plans such as high availability, real‑time ML, and unified batch‑stream processing.

DataFunTalk
DataFunTalk
DataFunTalk
Didi's Real-Time Computing Practices with Apache Flink: Architecture, StreamSQL, and Operational Insights

Apache Flink is a distributed big‑data processing engine capable of stateful computation on both bounded and unbounded streams. Didi has heavily optimized Flink and added features such as extended DDL, built‑in message format parsing, and custom UDX to meet its business needs.

1. Didi Big‑Data Service Architecture – Didi builds a complete big‑data ecosystem that includes offline and real‑time systems (HBase, Elasticsearch, Kafka, etc.). On top of Flink, Didi mainly develops StreamSQL, a SQL‑based service for stream processing.

2. Evolution of Stream Computing at Didi – Before 2017, Didi used various engines (Storm, Spark Streaming, Samza). In 2017 the company consolidated to a large service‑oriented cluster and introduced Flink for low‑latency workloads. By 2018 StreamSQL was created, and by 2019 Flink became the primary engine, handling over 3,000 tasks and processing trillions of events daily.

3. Scale and Scenarios – More than 50 real‑time services run on thousands of nodes, processing over a trillion records per day. Core scenarios include real‑time monitoring, data synchronization, feature extraction for dispatch, and various real‑time business functions such as location updates, anomaly detection, and personalized coupon distribution.

4. Multi‑Cluster Management – Didi adds a routing layer on YARN to present a single logical cluster, uses labels for isolation, and customizes the YARN scheduler (CPU‑based for real‑time, throughput‑oriented for batch). This enables fine‑grained resource control across many physical clusters.

5. Advantages of StreamSQL

Declarative language – users describe business logic without dealing with low‑level details.

Stable interface – SQL syntax remains consistent across Flink versions.

Easy troubleshooting – clear syntax helps locate errors.

Batch‑stream integration – shared syntax with HiveSQL/SparkSQL.

Low entry barrier – simple to learn and adopt.

StreamSQL also provides built‑in DDL for common sources (Kafka, binlog, JSON) and extensible UDX for custom functions, supporting joins (TTL‑based, dimension table joins) and multi‑format parsing.

6. StreamSQL IDE – An integrated development environment offering SQL templates, UDF documentation, syntax checking, online debugging (uploading test data or sampling Kafka sources), and version management for task upgrades and rollbacks.

7. Task Management and Operations

Web‑based lifecycle management (submit, stop, upgrade, rollback) with parameter tuning.

Enhanced log retrieval via Elasticsearch.

Comprehensive metric monitoring and alerting (restart, checkpoint failures, latency, etc.).

Lineage tracing across multiple hops (source → stream → sink) to diagnose issues.

8. Challenges

Large state management – checkpoint overhead and lack of visibility into state health.

Business high availability – need for zero‑downtime upgrades and rapid fault diagnosis.

Multi‑language support – extending beyond Java/Scala to Go and Python for UDFs.

9. Future Plans

Provide a highly available real‑time computing service for all online business.

Explore real‑time machine learning with sub‑second model updates.

Build a real‑time data warehouse that aligns with batch reporting while delivering low‑latency insights.

Continue meta‑store unification to enable seamless batch‑stream integration across engines.

The session concludes with a thank‑you note and an invitation to join the DataFunTalk community for further big‑data and AI discussions.

data engineeringbig dataApache Flinkreal-time computingStreamSQL
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.