Data Heterogeneity Explained: Sharding, Query Dimensions & MySQL Binlog with Canal
This article introduces the concept of data heterogeneity in large-scale systems, explains how sharding creates query challenges, describes query‑dimension and aggregation heterogeneity, and details implementation techniques such as subscribing to MySQL binlog and using Alibaba’s Canal for reliable data synchronization and storage.
1. What Is Data Heterogeneity?
In large systems, databases often use sharding (分库分表) to address capacity and performance issues, but sharding makes cross‑dimensional queries and aggregation queries cumbersome.
For example, an order table may be split across multiple tables based on order ID.
A single user's orders can be scattered across many tables; data heterogeneity solves this by building table structures according to different query dimensions, enabling queries on each dimension separately.
Or store the data in Elasticsearch:
Data heterogeneity mainly stores the relationships between data, then fetches actual data from source databases during queries; sometimes data redundancy is used to improve query performance.
This approach is called query‑dimension heterogeneity ; there is also aggregation data heterogeneity .
For example, a product detail page includes basic info, attributes, and images. To display it, the front end must query three or more databases by product ID, illustrating an aggregation scenario.
If one of the databases is unstable or slow, the product page suffers; in such cases we can store the aggregated data in a KV‑store cluster.
2. Implementation Methods of Data Heterogeneity
A common method is to subscribe to database change logs, such as MySQL binlog, to simulate master‑slave synchronization, parse the logs, and write the data to an order list, achieving data heterogeneity.
2.1 MySQL Master‑Slave Replication Principle
Since this simulates master‑slave sync, let’s review MySQL replication.
Client writes data to the master.
Master records the changes in the binary log (binlog).
Slave subscribes to the master’s binlog, pulling logs via an I/O thread.
Slave’s I/O thread writes the fetched logs to a relay log for replay.
Slave’s SQL thread reads the relay log and replays the statements, achieving synchronization.
2.2 Canal Introduction
Canal is an open‑source Alibaba project based on MySQL binlog for incremental subscription and consumption. It acts like a slave database, subscribing to the master’s binlog, reading and parsing it to achieve data synchronization or heterogeneity.
Canal mimics the MySQL slave interaction protocol, disguising itself as a MySQL slave.
Using Canal, you can subscribe to binlog events for purposes such as data mirroring, heterogeneity, indexing, cache updates, ensuring data order and consistency.
Canal architecture diagram:
First, deploy Canal server instances; only one is active while others are standby, with high availability managed by Zookeeper.
Canal server subscribes to database binlog via a master‑slave mechanism.
Canal client subscribes to the server, consumes changed table data, and writes it to mirror databases, heterogeneous databases, caches, etc., according to the application scenario.
Multiple Canal clients can be deployed, but only one is active; standby clients are managed by Zookeeper, which also tracks the current log position.
Canal server keeps binlog events only in memory, and only a single Canal client can consume them.
If multiple consumers are needed, the client can write data to a message queue, and separate consumers can process the queue, avoiding multiple Canal servers pulling the same binlog and overloading the database.
When the database is already under heavy load, Canal server can subscribe to an existing slave’s binlog, forming a master‑slave‑slave topology.
Canal project address: https://github.com/alibaba/canal
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java High-Performance Architecture
Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
