Big Data 18 min read

Design and Implementation of a Real-Time Data Transmission Platform Based on Apache Flink at AutoHome

This article presents the background, requirements, architectural design, component interaction, and implementation details of AutoHome's real‑time data transmission platform built on Apache Flink, highlighting its high availability, exactly‑once semantics, scalability, DDL handling, and integration with existing streaming services.

DataFunTalk
DataFunTalk
DataFunTalk
Design and Implementation of a Real-Time Data Transmission Platform Based on Apache Flink at AutoHome

Data ingestion and transmission serve as a critical bridge between data systems and business systems, directly affecting service level agreements and system stability.

AutoHome faced challenges such as manual task management, resource waste, DDL handling, component dependencies, and technical debt, prompting the development of a new data transmission and distribution system.

Background and Requirements

Lack of effective task and information management, relying on manual operations.

Resource waste and lack of elasticity in ingestion programs.

Inadequate handling of DDL changes, requiring manual intervention.

Multiple dependent components (Zookeeper, Redis, etc.).

Accumulated technical debt increasing maintenance cost.

The new system aims to provide friendly operations management, high availability with fault‑recovery, multi‑region active‑active support, strong data accuracy (at‑least‑once semantics), elastic scaling, comprehensive monitoring, and full DDL protection.

Technical Selection and Design – Why Flink?

Three options were considered: fully custom development, customizing open‑source components (Maxwell/Canal/Debezium), and building on Flink. Flink was chosen for its low development cost, built‑in HA, fault‑recovery, and strong streaming capabilities.

Design Goals

Operations‑friendly architecture with high availability and fault‑recovery.

Strong data accuracy, guaranteeing at‑least‑once semantics.

Elastic scaling (horizontal and vertical).

Comprehensive monitoring and alerting, supporting metadata management.

Real‑time computation friendliness.

Complete defense against DDL changes.

The system must meet latency and throughput requirements for all regular business states.

Solution Design and Comparison

Item

Fully Custom

Open‑Source Custom

Flink‑Based

High Availability

Support/High cost

Possible/Medium cost

Support/Low cost

Fault Recovery

Support/Medium cost

Possible/Unclear

Support/Low cost

Horizontal Scaling

Possible/High cost

Possible/Unclear

Support/Low cost

Vertical Scaling

Support/High cost

Possible/Unclear

Support/Low cost

Data Accuracy

At‑least‑once/Medium cost

Exactly‑once/Low cost

Exactly‑once for normal state, at‑least‑once for disaster recovery/Medium cost

Multi‑Region Active‑Active

Support/High cost

Possible/High cost

Support/Low cost

Monitoring & Alerting

Good/Medium cost

Possible/High cost

Good/Low cost

Real‑time Friendly

Medium/Medium cost

Poor/High cost

Good/Low cost

After discussion, the team unanimously decided to develop the new transmission platform based on Flink.

Implementation Highlights

Flink DataStream API provides a natural model for data transmission.

Flink offers built‑in consistency, HA, stability, and flow control, reducing development complexity.

Flink’s native horizontal and vertical scaling allows on‑demand resource allocation.

The management module reuses AutoHome’s existing Flink platform components for monitoring, lifecycle management, multi‑region active‑active, and self‑service operations.

The MVP was completed in less than three weeks, meeting performance and functional expectations.

System Architecture

The platform consists of three logical layers: Data Transmission Program, Task Information Management Module, and Runtime Execution Module.

Data Transmission Program: Fixed Flink JAR + Flink SQL Codegen Service generated SQL tasks.

Management Module: A microservice communicating with Flink components to handle task and information management.

Runtime Module: Directly depends on the Flink cluster.

Component Interaction

Key components include AutoDTS (task management), AutoStream Core (Flink core), Jar Service (SDK JAR storage), Metastore (metadata), and Flink Client (RESTful job submission to YARN/K8S).

AutoDTS interacts with users for task lifecycle operations, forwards task info to Flink, and creates corresponding Flink tables in Metastore.

For each transmission task, AutoDTS delegates parameters and SQL logic to Core, loads SDK JARs from Jar Service, and submits jobs via the client. SQL Codegen Service translates parameters into executable Flink SQL tasks.

Transmission Task Types

Two task types exist: Ingestion tasks (capture changelog streams from sources and write to Kafka) and Distribution tasks (read from Kafka and write to target storage).

Binlog Ingestion SDK

The Binlog SDK follows the classic Source → Transformation → Sink stages.

Binlog Source

Implemented by creating a BinaryLogClient that continuously fetches binlog events, converting them minimally before emitting downstream. Performance considerations include single‑parallelism, bounded blocking queue between client and source function, and ensuring transaction completeness using TransactionEnd events.

Checkpoint integration uses Flink’s CheckpointLock to guarantee that only complete MySQL transactions are emitted between checkpoints.

UnifiedFormatTransform

This stage converts data into a unified format and includes DDL handling by embedding an Antlr4 MySQL parser, updating schema state, and emitting DDL events, thereby defending against DDL‑induced failures.

Kafka Sink

Leverages Flink’s native Exactly‑Once guarantees to write transformed data to Kafka, optionally enabling transactional consumption for strong transaction semantics.

Additional Optimizations

GTID support with one‑click master‑slave switching.

Periodic backup of runtime information to external storage.

Comprehensive monitoring metrics for binlog synchronization tasks.

Platform Usage

Users configure necessary parameters to create and run transmission tasks. Ingestion tasks become shared assets; users can query or request new table ingestion. Distribution tasks (e.g., Iceberg) involve selecting source fields, configuring resources, and launching the job, which yields a unique Flink task ID.

The platform provides rich monitoring, metadata queries, and tight integration with the existing real‑time computation platform.

Conclusion and Outlook

The Flink‑based transmission platform proved to be a cost‑effective solution that resolved previous issues, improved data access efficiency, and enhanced technical capabilities. Future work includes leveraging new Flink features (FLIP‑27 Source, OperatorCoordinator, Upsert‑Kafka) and expanding data lake integration.

References

[1] https://github.com/pingcap/ticdc/pull/804

[2] https://github.com/zendesk/maxwell

Author

Liu Shouwei, a graduate of Dalian University of Technology, Apache Flink contributor, Scala/Akka enthusiast, joined AutoHome in 2019 to develop and maintain the real‑time computation and data transmission platforms.

Big DataFlinkApache FlinkData StreamingReal-time Data Transfer
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.