Using Flinkx for Data Synchronization in Sharded MySQL Environments
This article explains how to leverage Flinkx and Flink Stream API to create a unified data‑sync task that extracts data from sharded MySQL tables, splits the workload, and pushes it to an MQ cluster, while detailing the underlying InputFormat and Reader architecture.
1. Scenario Description
The example shows an order system that has been partitioned into multiple databases and tables (four databases, eight tables). The requirement is to create a single task that synchronizes data to an MQ cluster instead of creating separate tasks for each database instance, since the table structures and mapping rules are identical.
2. Flinkx Solution Details
2.1 Flink Stream API Development Process
The general steps for programming with Flink Stream API are illustrated in the diagram below.
Note: Detailed Stream API content will be covered in future articles; this article focuses on InputFormatSourceFunction and data source splitting.
2.2 Flinkx Reader (Data Source) Core Class Diagram
The core class hierarchy of Flinkx Readers is shown below, with BaseDataReader as the base class.
Key classes include:
InputFormat : Core Flink API for splitting and reading input data. Important methods: configure , getStatistics , createInputSplits , getInputSplitAssigner .
void open(T split) : Opens a data channel for a given split; examples for JDBC and Elasticsearch are shown.
boolean reachedEnd() : Indicates whether the data source has been exhausted (bounded data).
OT nextRecord(OT reuse) : Retrieves the next record from the channel.
void close() : Closes the source.
InputSplit : Root interface for data partitions, providing int getSplitNumber() .
Implementation example: GenericInputSplit with fields partitionNumber and totalNumberOfPartitions , useful for modulo‑based splitting of large tables.
Other related interfaces: SourceFunction , RichFunction , ParallelSourceFunction , RichParallelSourceFunction , InputFormatSourceFunction , and BaseDataReader .
2.3 Building a DataStream with Flinkx Reader
After understanding the class diagram, the article demonstrates the read flow of DistributedJdbcDataReader (a subclass of BaseDataReader ). The process creates an InputFormat , then a corresponding SourceFunction , and finally adds the source to the StreamExecutionEnvironment to obtain a DataStreamSource .
2.4 Flinkx Solution for Sharded Database Task Splitting
Given the scenario of a four‑database, eight‑table order system, performance can be improved by:
Splitting by database and table, resulting in eight independent tasks.
Further splitting each table's data, e.g., using id % totalNumberOfPartitions = partitionNumber for modulo‑based distribution.
Flinkx follows this strategy. The workflow is illustrated below.
Step 1: Split by database instance and table, forming a DataSource list.
Step 2: Implement the actual split logic inside DistributedJdbcInputFormat#createInputSplitsInternal .
Step 3: If a splitKey is specified, generate SQL where clauses such as splitKey % totalNumberOfPartitions = partitionNumber to achieve parallel extraction.
Step 4: If no table‑level split key is provided, the algorithm splits the sourceList itself, distributing tables among partitions.
The discussion of task splitting in Flinkx ends here.
3. Conclusion
This article introduced how to use Flinkx to split data extraction tasks for MySQL sharding scenarios, covering basic Flink programming patterns, the InputFormat and SourceFunction class hierarchy, and practical splitting strategies.
Note: Detailed Flink API analysis will be covered in future articles; the current series does not follow a strict sequential order.
Thank you for reading – likes, comments, and shares are greatly appreciated.
Full-Stack Internet Architecture
Introducing full-stack Internet architecture technologies centered on Java
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.