Design and Implementation of a Real-Time Data Transmission Platform Based on Apache Flink at AutoHome
This article presents the background, requirements, architectural design, component interaction, and implementation details of AutoHome's real‑time data transmission platform built on Apache Flink, highlighting its high availability, exactly‑once semantics, scalability, DDL handling, and integration with existing streaming services.
Data ingestion and transmission serve as a critical bridge between data systems and business systems, directly affecting service level agreements and system stability.
AutoHome faced challenges such as manual task management, resource waste, DDL handling, component dependencies, and technical debt, prompting the development of a new data transmission and distribution system.
Background and Requirements
Lack of effective task and information management, relying on manual operations.
Resource waste and lack of elasticity in ingestion programs.
Inadequate handling of DDL changes, requiring manual intervention.
Multiple dependent components (Zookeeper, Redis, etc.).
Accumulated technical debt increasing maintenance cost.
The new system aims to provide friendly operations management, high availability with fault‑recovery, multi‑region active‑active support, strong data accuracy (at‑least‑once semantics), elastic scaling, comprehensive monitoring, and full DDL protection.
Technical Selection and Design – Why Flink?
Three options were considered: fully custom development, customizing open‑source components (Maxwell/Canal/Debezium), and building on Flink. Flink was chosen for its low development cost, built‑in HA, fault‑recovery, and strong streaming capabilities.
Design Goals
Operations‑friendly architecture with high availability and fault‑recovery.
Strong data accuracy, guaranteeing at‑least‑once semantics.
Elastic scaling (horizontal and vertical).
Comprehensive monitoring and alerting, supporting metadata management.
Real‑time computation friendliness.
Complete defense against DDL changes.
The system must meet latency and throughput requirements for all regular business states.
Solution Design and Comparison
Item
Fully Custom
Open‑Source Custom
Flink‑Based
High Availability
Support/High cost
Possible/Medium cost
Support/Low cost
Fault Recovery
Support/Medium cost
Possible/Unclear
Support/Low cost
Horizontal Scaling
Possible/High cost
Possible/Unclear
Support/Low cost
Vertical Scaling
Support/High cost
Possible/Unclear
Support/Low cost
Data Accuracy
At‑least‑once/Medium cost
Exactly‑once/Low cost
Exactly‑once for normal state, at‑least‑once for disaster recovery/Medium cost
Multi‑Region Active‑Active
Support/High cost
Possible/High cost
Support/Low cost
Monitoring & Alerting
Good/Medium cost
Possible/High cost
Good/Low cost
Real‑time Friendly
Medium/Medium cost
Poor/High cost
Good/Low cost
After discussion, the team unanimously decided to develop the new transmission platform based on Flink.
Implementation Highlights
Flink DataStream API provides a natural model for data transmission.
Flink offers built‑in consistency, HA, stability, and flow control, reducing development complexity.
Flink’s native horizontal and vertical scaling allows on‑demand resource allocation.
The management module reuses AutoHome’s existing Flink platform components for monitoring, lifecycle management, multi‑region active‑active, and self‑service operations.
The MVP was completed in less than three weeks, meeting performance and functional expectations.
System Architecture
The platform consists of three logical layers: Data Transmission Program, Task Information Management Module, and Runtime Execution Module.
Data Transmission Program: Fixed Flink JAR + Flink SQL Codegen Service generated SQL tasks.
Management Module: A microservice communicating with Flink components to handle task and information management.
Runtime Module: Directly depends on the Flink cluster.
Component Interaction
Key components include AutoDTS (task management), AutoStream Core (Flink core), Jar Service (SDK JAR storage), Metastore (metadata), and Flink Client (RESTful job submission to YARN/K8S).
AutoDTS interacts with users for task lifecycle operations, forwards task info to Flink, and creates corresponding Flink tables in Metastore.
For each transmission task, AutoDTS delegates parameters and SQL logic to Core, loads SDK JARs from Jar Service, and submits jobs via the client. SQL Codegen Service translates parameters into executable Flink SQL tasks.
Transmission Task Types
Two task types exist: Ingestion tasks (capture changelog streams from sources and write to Kafka) and Distribution tasks (read from Kafka and write to target storage).
Binlog Ingestion SDK
The Binlog SDK follows the classic Source → Transformation → Sink stages.
Binlog Source
Implemented by creating a BinaryLogClient that continuously fetches binlog events, converting them minimally before emitting downstream. Performance considerations include single‑parallelism, bounded blocking queue between client and source function, and ensuring transaction completeness using TransactionEnd events.
Checkpoint integration uses Flink’s CheckpointLock to guarantee that only complete MySQL transactions are emitted between checkpoints.
UnifiedFormatTransform
This stage converts data into a unified format and includes DDL handling by embedding an Antlr4 MySQL parser, updating schema state, and emitting DDL events, thereby defending against DDL‑induced failures.
Kafka Sink
Leverages Flink’s native Exactly‑Once guarantees to write transformed data to Kafka, optionally enabling transactional consumption for strong transaction semantics.
Additional Optimizations
GTID support with one‑click master‑slave switching.
Periodic backup of runtime information to external storage.
Comprehensive monitoring metrics for binlog synchronization tasks.
Platform Usage
Users configure necessary parameters to create and run transmission tasks. Ingestion tasks become shared assets; users can query or request new table ingestion. Distribution tasks (e.g., Iceberg) involve selecting source fields, configuring resources, and launching the job, which yields a unique Flink task ID.
The platform provides rich monitoring, metadata queries, and tight integration with the existing real‑time computation platform.
Conclusion and Outlook
The Flink‑based transmission platform proved to be a cost‑effective solution that resolved previous issues, improved data access efficiency, and enhanced technical capabilities. Future work includes leveraging new Flink features (FLIP‑27 Source, OperatorCoordinator, Upsert‑Kafka) and expanding data lake integration.
References
[1] https://github.com/pingcap/ticdc/pull/804
[2] https://github.com/zendesk/maxwell
Author
Liu Shouwei, a graduate of Dalian University of Technology, Apache Flink contributor, Scala/Akka enthusiast, joined AutoHome in 2019 to develop and maintain the real‑time computation and data transmission platforms.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.