Big Data 12 min read

Evolution and Architecture of DiDi Data Channel Service

DiDi’s Data Channel Service evolved from a fragmented component system into a unified, SLA‑driven platform with a UI‑based Sync Center and Flink‑powered StreamSQL engine, dramatically improving task creation speed, resource utilization, and reliability while automating issue diagnosis for company‑wide real‑time and offline data synchronization.

Didi Tech
Didi Tech
Didi Tech
Evolution and Architecture of DiDi Data Channel Service

DiDi's Data Channel Engine carries the company's data synchronization, providing essential source data for downstream real‑time and offline scenarios. As task volume continuously grows, the overall architecture of the data channel has evolved. This article introduces the development history of the DiDi Data Channel, the problems encountered, and future plans.

Background

Data is a critical asset for any internet company. The big‑data department at DiDi focuses on better utilization of data and extracting its value. The Data Channel Service, as the front‑end of the big‑data pipeline, silently provides timely and complete data services for the company. This article gives a comprehensive overview of the evolution of DiDi's Data Channel.

Data Channel Overview

The Data Channel Service is a solution that synchronizes data from point A to point B. Heterogeneous data synchronization is a common need across many business lines, making the channel a fundamental service. It handles various sources such as logs and Binlog and delivers them to downstream stores like Hive, Elasticsearch, HBase, etc., for reporting and operations.

The core workflow can be illustrated by a simplified directed graph where vertices represent storage (disk, message queues, storage services) and edges represent data flow driven by synchronization engines. The basic components include:

Container : The runtime environment where business services generate heterogeneous raw data (logs, Binlog).

Agent : Collects data (e.g., logs, Binlog) and pushes it to a message queue, ensuring at‑least‑once collection by tracking file offsets.

Kafka : Provides decoupling, peak‑shaving, and high‑throughput buffering for multiple downstream consumers.

DSink : Consumes messages from the queue and delivers them to downstream storage, guaranteeing at‑least‑once delivery via offset management.

ES/HDFS : Storage engines where structured data is finally written for business consumption.

ETL : Transforms data written to HDFS into Hive tables for analysis.

Data Warehouse : Stores structured data for downstream systems.

Business Systems : Directly query ES or the data warehouse to provide online or near‑online services.

Evolution of the Data Channel Service

The service has gone through several stages:

Component Platformization : Initially, each component was independently maintained. As task numbers grew, a unified platform was needed to manage, create, modify, and delete tasks centrally.

Serviceization : Emphasized SLA commitments and at‑least‑once guarantees across the pipeline. Data integrity and timeliness became key metrics, requiring both component‑level guarantees and a data‑quality center.

Productization : To improve user experience, a unified configuration portal (Sync Center) was built, allowing users to create tasks via a UI. The portal automatically translates configurations into engine‑level tasks, reducing task‑creation time from hours to 5‑10 minutes.

Engine Upgrade – Flink (StreamSQL) : Migrated delivery components to Flink, enabling fine‑grained resource control (1C1G units) and unified SQL‑based task templates. StreamSQL’s UDF support allows custom parsing logic while still writing to ES/HDFS, reducing development effort and expanding service scope. Resource utilization improved from <30% to >60% and the number of physical machines required dropped from ~400 to ~250.

Intelligence (in progress) : Building an automated diagnosis and remediation system (LogX) to handle ~80% of daily engine issues automatically, covering resource allocation, problem diagnosis, full‑link lineage, and data governance.

Summary

The Data Channel Service underpins the entire company's data synchronization. Most offline tasks receive their source data from this service, making it the main data artery of DiDi. Over the years, the service has become more stable, reliable, and efficient, continuously delivering data collection and delivery for the whole organization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataFlinkKafkadata synchronizationETL
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.