Databases 12 min read

Understanding Data Heterogeneity: Scenarios, Methods, and Implementation with Binlog, Canal, and MQ

This article explains the concept of data heterogeneity, outlines common use cases such as sharding and multi‑dimensional queries, and details practical implementation methods including full cloning, marked sync, binlog‑based replication with Canal, and MQ‑driven approaches, while providing deployment tips and references.

Code Ape Tech Column

Sep 15, 2022

Understanding Data Heterogeneity: Scenarios, Methods, and Implementation with Binlog, Canal, and MQ

Common Application Scenarios

In sharding scenarios, databases are often split by order ID to improve query performance, but later business requirements may need queries by merchant ID, which becomes cumbersome. Data heterogeneity can solve this by rebuilding storage according to the new dimension.

Below is an illustration of how heterogeneity addresses the problem:

Data heterogeneity can be summarized into several typical scenarios:

Database mirroring

Real‑time database backup

Multi‑level indexing

Search build (e.g., multi‑dimensional queries after sharding)

Business cache refresh

Important business messages such as price or inventory changes

Data Heterogeneity Directions

In daily development, data can flow in several directions, the most common being DB‑to‑DB. After sharding by order ID, aggregating queries by user ID become difficult; rebuilding the data in a new table (or moving it to Redis, Elasticsearch, etc.) solves the multi‑dimensional query need and improves performance under high traffic.

Common Methods of Data Heterogeneity

1. Full Clone

This simply copies the entire source database A to a target database B, useful for offline statistical tasks but unsuitable for continuously growing data.

2. Marked Sync

For simple business scenarios where data rarely changes (e.g., log data), a marker such as a timestamp can be added; in case of failure, synchronization can resume from the last marked point.

3. Binlog Method

By subscribing to MySQL binlog in real time, the logs are consumed and the data structure is rebuilt in a new database or other stores like Elasticsearch or SOR. This approach helps maintain data consistency.

4. MQ Method

When writing to the primary DB, a copy of the data is also sent to a message queue, achieving dual‑write. This method is simple but cannot guarantee cross‑resource transaction consistency.

Binlog Method

Binlog records every data change in MySQL. Open‑source components like Alibaba's Canal provide incremental subscription and consumption of binlog events.

Because a single Canal server keeps events only in memory, multiple consumer clients can be added via ActiveMQ or Kafka (shown by the green dashed box in the diagram).

To ensure full‑data consistency, a full‑sync worker program can be introduced (the dark‑green part in the diagram).

Canal Working Principle

First, consider MySQL master‑slave replication:

Replication consists of three steps:

Master writes changes to the binary log (binlog).

Slave copies the master’s binlog events to its relay log.

Slave replays the relay log events to reflect the changes locally.

Canal mimics the slave side:

Its implementation is straightforward:

Canal pretends to be a MySQL slave and sends a dump request to the master.

Master pushes the binary log to Canal.

Canal parses the binary log (originally a byte stream) and forwards the data to the target store.

For high availability, multiple Canal servers are deployed; only one processes the logs while others act as hot‑standby, coordinated by Zookeeper.

Notes

Ensure MySQL binlog is enabled: show variables like 'log_bin'; (ON means enabled).

Confirm the target database generates binlog: show master status; and check Binlog_Do_DB, Binlog_Ignore_DB parameters.

Set binlog format to ROW: show variables like 'binlog_format'; If not ROW, run set global binlog_format=ROW; flush logs; or modify MySQL config and restart.

Grant necessary privileges for binlog access:

GRANT SELECT, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'admin'@'%' identified by 'admin'; FLUSH PRIVILEGES;

MQ Method

The MQ approach simply writes to the message queue alongside the DB write. While easy to implement, it cannot guarantee data consistency across resources; therefore, remote RPC calls should never be placed inside a transaction.

Summary

This article described the usage scenarios and methods of data heterogeneity, mentioning tools such as Canal and ActiveMQ without deep analysis. Readers can refer to the linked documentation for detailed usage.

By constructing storage in different locations according to the definition of data heterogeneity, many problems—like querying sharded data by alternative dimensions or offloading high‑traffic reads to caches like Redis—can be effectively solved.

Recommended Reading (please follow, no free rides!)

Netty How to Achieve One Million Concurrent Connections on a Single Machine?

The Safest Encryption Algorithm Bcrypt – No More Data Leaks

Practical Guide: Spring Cloud Gateway Integrated with OAuth2.0 for Distributed Authentication and Authorization

Why Is Nacos So Powerful from an Implementation Perspective?

Alibaba's Rate‑Limiting Tool Sentinel – 17 Questions

OpenFeign – 9 Tough Questions

Spring Cloud Gateway – 10 Tough Questions

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Elasticsearch Message Queue Canal data heterogeneity

Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.