Mastering MySQL to Elasticsearch Sync: 4 Strategies & Top Migration Tools
This article explores four practical methods for synchronizing MySQL data to Elasticsearch—including synchronous and asynchronous double writes, SQL extraction, and binlog real‑time replication—while reviewing popular migration tools such as Canal, Alibaba DTS, and Databus to help you choose the right solution.
1. Introduction
In many projects MySQL serves as the primary business database while Elasticsearch is used for query operations to achieve read‑write separation, reduce MySQL query load, and handle complex large‑scale searches. A key challenge is keeping MySQL and Elasticsearch data synchronized.
2. Data Synchronization Solutions
2.1 Synchronous Double Write
The simplest approach writes data to both MySQL and Elasticsearch simultaneously.
Advantages: simple business logic and high real‑time performance.
Business logic is simple.
High real‑time capability.
Disadvantages: hard‑coded, strong coupling, risk of data loss on double‑write failure, and performance degradation.
Requires code changes wherever MySQL is written.
Strong coupling between services.
Potential data loss if one write fails.
Overall performance suffers.
2.2 Asynchronous Double Write
For multiple data sources, a message queue (MQ) can be used to write asynchronously.
Advantages: high performance, reduced data loss thanks to MQ consumption guarantees, and easy extension to additional sources.
High performance.
MQ ensures retry on ES downtime or write failures.
Isolated writes allow easy addition of new sources.
Disadvantages: hard‑coded integration, increased system complexity, and potential latency due to asynchronous processing.
New sources require new consumer code.
Introduces message‑queue overhead.
Data may not be visible immediately.
2.3 SQL Extraction (Timer‑Based)
This method adds a timestamp column to relevant tables, uses a scheduled job to poll changed rows, and writes them to Elasticsearch, avoiding code intrusion.
Add a timestamp column to tables.
Leave existing CRUD operations unchanged.
Run a periodic timer to scan for changes.
Write each changed row to Elasticsearch.
Advantages: no code changes, no strong coupling, and simple worker implementation.
No intrusion into existing code.
No performance impact on the business system.
Worker logic is straightforward.
Disadvantages: lower timeliness due to polling intervals and added load on the database.
Potential delay even with second‑level intervals.
Polling can stress the DB; using a replica can mitigate.
Classic solution: use Logstash to periodically query new data via SQL and write incremental changes to Elasticsearch.
2.4 Binlog Real‑Time Synchronization
To achieve real‑time sync without code intrusion, MySQL binlog can be leveraged.
Steps:
Read MySQL binlog to obtain table change events.
Convert events to MQ messages.
Develop an MQ consumer.
Consume messages and write each change to Elasticsearch.
Advantages: no code changes, high performance, and business decoupling.
No intrusion or hard‑coding.
Existing systems remain unchanged.
High throughput.
Business logic is independent of sync process.
Disadvantages: complex binlog setup and possible MQ latency.
Building a binlog‑based system is intricate.
If MQ is used, it inherits MQ’s delay risk.
3. Data Migration Tool Selection
Among the four sync methods, binlog real‑time sync is most common and has spawned several robust migration tools that capture MySQL binlog changes (CDC) and forward INSERT, UPDATE, DELETE events downstream.
3.1 Canal
Canal acts as a MySQL slave to subscribe to binlog, parse it into JSON, and deliver changes via TCP or MQ.
Canal server requests dump protocol from MySQL master.
Master pushes binlog; Canal parses bytes to JSON.
Canal client listens via TCP or MQ and syncs data to Elasticsearch.
3.2 Alibaba Cloud DTS
Data Transmission Service supports multiple source types (RDBMS, NoSQL, OLAP) and offers data migration, real‑time subscription, and synchronization with high performance (up to 70 MB/s, 200 k TPS) and high availability.
Supports many data sources.
Multiple transmission modes.
High throughput and TPS.
Clustered architecture ensures failover.
Visual UI and wizard‑based link creation.
Commercial service (paid).
3.3 Databus
Databus, an open‑source LinkedIn project, captures low‑latency, reliable, transactional changes from databases (MySQL, Oracle) and delivers them to consumers.
Supports multiple source types.
Highly scalable and available.
Preserves transaction order.
Millisecond‑level latency with flexible subscription.
Unlimited replay for consumers.
4. Conclusion
This article presented various MySQL‑to‑Elasticsearch synchronization approaches and introduced several popular data migration tools, helping you select the most suitable solution for your project.
macrozheng
Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.