Inside Didi’s Real-Time Data Warehouse for Ride-Sharing: Architecture & Lessons
This article details Didi’s end‑to‑end construction of a real‑time data warehouse for the Ride‑Sharing (顺风车) business, covering motivations, layer‑by‑layer architecture, naming conventions, StreamSQL capabilities, operational tooling, achieved results, challenges, and future batch‑stream integration plans.
Background and Motivation
As Didi’s ride‑sharing services grew rapidly, the need for timely data to support fine‑grained operations and fast product iteration became critical. Real‑time data enables quicker decision making, improves user feedback loops, and supports intelligent business monitoring.
Purpose of the Real‑Time Data Warehouse
The goal is to complement traditional offline warehouses by delivering low‑latency data for scenarios where offline latency is unacceptable. It aims to solve three core problems: urgent business demand for real‑time data, lack of standardized real‑time pipelines, and decreasing development costs thanks to mature platform tools.
Key Application Scenarios
Real‑time OLAP analysis using Flink Stream SQL, Kafka, Druid, ClickHouse.
Live dashboards for order volume, coupon spend, city‑level metrics.
Business‑critical monitoring such as safety, finance, and complaint metrics.
Real‑time data‑service APIs for cross‑team data consumption.
Case Study: Didi Ride‑Sharing Real‑Time Warehouse
The data team collaborated closely with the Ride‑Sharing line to iteratively build a layered warehouse that includes detailed (ODS/DWD) and aggregated (DWM) tables, a unified DWD layer, reduced resource consumption, and high data reuse. The architecture diagram is shown below.
Layered Architecture Details
1. ODS (Source) Layer – Ingests raw binlog, public logs, and traffic event logs into Kafka or DDMQ. Naming follows patterns such as cn-binlog‑{db‑name}-{db‑name} for auto‑generated topics and realtime_ods_binlog_{source}/{log} for custom topics.
2. DWD (Detail) Layer – Builds fine‑grained fact tables driven by business processes, adds selective redundancy for analytical convenience, and stores data in Kafka and Druid for downstream queries. Table names follow realtime_dwd_{biz}_{domain}_{process}_{tag} (e.g., realtime_dwd_trip_trd_order_base).
3. DIM (Dimension) Layer – Provides consistent dimension tables stored in MySQL, HBase, or Didi’s Fusion KV store, depending on size and query QPS requirements. Naming pattern: dim_{biz}_{dimension}[_{tag}] (e.g., dim_trip_dri_base).
4. DWM (Summary) Layer – Performs multi‑dimensional aggregation per business theme, using Stream SQL for minute‑level PV aggregation and Druid for UV de‑duplication. Table naming: realtime_dwm_{biz}_{domain}_{grain}_{tag}_{period} (e.g., realtime_dwm_trip_trd_pas_bus_accum_1min).
5. APP (Application) Layer – Writes aggregated results to downstream stores (Druid for dashboards, HBase for services, MySQL/Redis for product data). No strict naming constraints due to the layer’s flexibility.
Construction Results
Five major modules (growth, transaction, experience, safety, finance) now power over 40 real‑time dashboards. Data discrepancy between real‑time and offline pipelines is kept below 0.5%, enabling on‑the‑fly coupon strategies, safety monitoring, and order trend analysis. The model also supports rapid metric definition changes and consistency checks, boosting development efficiency.
Strong Dependence on the Data Platform
The real‑time warehouse relies on Didi’s Data Dream Factory platform, which provides StreamSQL, an IDE, task operation tools, and meta‑store capabilities.
StreamSQL Features
Declarative language that abstracts underlying implementation.
Stable API across Flink versions.
Rich DDL covering sources (Kafka, DDMQ) and sinks (Druid, HBase, MySQL).
Built‑in parsers for binlog, business logs, and JSON.
Extensible UDX/UDFs and Hive‑compatible extensions.
Advanced join support: TTL‑based long‑window joins and dimension joins across HBase, KVStore, MySQL.
Development and Operations Support
IDE with SQL templates and UDF libraries.
Online debugging with sample data upload.
Version management and task rollback.
Centralized log collection in Elasticsearch with web UI.
Metric monitoring dashboards for Flink metrics.
Alerting for task failures, latency, checkpoint issues.
Lineage tracing across multi‑stage pipelines.
Challenges and Proposed Solutions
Key challenges include the lack of a real‑time development standard, and ensuring consistency between real‑time and offline results. Didi introduced a white‑paper covering requirement capture, metric definition, development, deployment, monitoring, and assurance. Consistency is achieved through joint validation with offline Hive tables, periodic cross‑checks, and a future plan to embed consistency checks into the meta‑store.
Future Outlook – Batch‑Stream Unification
While Flink already supports batch‑stream integration, Didi aims to achieve product‑level unification by consolidating all metadata (Hive tables, Kafka topics, HBase, ES) into a single MetaStore. All engines (Hive, Spark, Presto, Flink) will query the same MetaStore, allowing the same SQL to run as batch or stream based on the source type.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
