Understanding Data Heterogeneity: Scenarios, Methods, and Implementation with Binlog, Canal, and MQ
This article explains the concept of data heterogeneity, outlines common use cases such as sharding and multi‑dimensional queries, and details practical implementation methods including full cloning, marked sync, binlog‑based replication with Canal, and MQ‑driven approaches, while providing deployment tips and references.
Common Application Scenarios
In sharding scenarios, databases are often split by order ID to improve query performance, but later business requirements may need queries by merchant ID, which becomes cumbersome. Data heterogeneity can solve this by rebuilding storage according to the new dimension.
Below is an illustration of how heterogeneity addresses the problem:
Data heterogeneity can be summarized into several typical scenarios:
Database mirroring
Real‑time database backup
Multi‑level indexing
Search build (e.g., multi‑dimensional queries after sharding)
Business cache refresh
Important business messages such as price or inventory changes
Data Heterogeneity Directions
In daily development, data can flow in several directions, the most common being DB‑to‑DB. After sharding by order ID, aggregating queries by user ID become difficult; rebuilding the data in a new table (or moving it to Redis, Elasticsearch, etc.) solves the multi‑dimensional query need and improves performance under high traffic.
Common Methods of Data Heterogeneity
1. Full Clone
This simply copies the entire source database A to a target database B, useful for offline statistical tasks but unsuitable for continuously growing data.
2. Marked Sync
For simple business scenarios where data rarely changes (e.g., log data), a marker such as a timestamp can be added; in case of failure, synchronization can resume from the last marked point.
3. Binlog Method
By subscribing to MySQL binlog in real time, the logs are consumed and the data structure is rebuilt in a new database or other stores like Elasticsearch or SOR. This approach helps maintain data consistency.
4. MQ Method
When writing to the primary DB, a copy of the data is also sent to a message queue, achieving dual‑write. This method is simple but cannot guarantee cross‑resource transaction consistency.
Binlog Method
Binlog records every data change in MySQL. Open‑source components like Alibaba's Canal provide incremental subscription and consumption of binlog events.
Because a single Canal server keeps events only in memory, multiple consumer clients can be added via ActiveMQ or Kafka (shown by the green dashed box in the diagram).
To ensure full‑data consistency, a full‑sync worker program can be introduced (the dark‑green part in the diagram).
Canal Working Principle
First, consider MySQL master‑slave replication:
Replication consists of three steps:
Master writes changes to the binary log (binlog).
Slave copies the master’s binlog events to its relay log.
Slave replays the relay log events to reflect the changes locally.
Canal mimics the slave side:
Its implementation is straightforward:
Canal pretends to be a MySQL slave and sends a dump request to the master.
Master pushes the binary log to Canal.
Canal parses the binary log (originally a byte stream) and forwards the data to the target store.
For high availability, multiple Canal servers are deployed; only one processes the logs while others act as hot‑standby, coordinated by Zookeeper.
Notes
Ensure MySQL binlog is enabled: show variables like 'log_bin'; (ON means enabled).
Confirm the target database generates binlog: show master status; and check Binlog_Do_DB , Binlog_Ignore_DB parameters.
Set binlog format to ROW: show variables like 'binlog_format'; If not ROW, run set global binlog_format=ROW; flush logs; or modify MySQL config and restart.
Grant necessary privileges for binlog access: GRANT SELECT, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'admin'@'%' identified by 'admin'; FLUSH PRIVILEGES;
MQ Method
The MQ approach simply writes to the message queue alongside the DB write. While easy to implement, it cannot guarantee data consistency across resources; therefore, remote RPC calls should never be placed inside a transaction.
Summary
This article described the usage scenarios and methods of data heterogeneity, mentioning tools such as Canal and ActiveMQ without deep analysis. Readers can refer to the linked documentation for detailed usage.
By constructing storage in different locations according to the definition of data heterogeneity, many problems—like querying sharded data by alternative dimensions or offloading high‑traffic reads to caches like Redis—can be effectively solved.
Recommended Reading (please follow, no free rides!)
Netty How to Achieve One Million Concurrent Connections on a Single Machine?
The Safest Encryption Algorithm Bcrypt – No More Data Leaks
Practical Guide: Spring Cloud Gateway Integrated with OAuth2.0 for Distributed Authentication and Authorization
Why Is Nacos So Powerful from an Implementation Perspective?
Alibaba's Rate‑Limiting Tool Sentinel – 17 Questions
OpenFeign – 9 Tough Questions
Spring Cloud Gateway – 10 Tough Questions
Code Ape Tech Column
Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.