Databases 17 min read

MySQL to Elasticsearch Data Synchronization Strategies and Tools

This article examines various methods for synchronizing MySQL data with Elasticsearch, including synchronous and asynchronous dual‑write, Logstash pipelines, binlog real‑time replication, Canal, and Alibaba Cloud DTS, comparing their architectures, advantages, disadvantages, and suitable application scenarios.

Top Architect

Aug 30, 2024

MySQL to Elasticsearch Data Synchronization Strategies and Tools

Overview

MySQL often serves as the core business database, but as data volume and query complexity increase, relying solely on MySQL for efficient retrieval becomes a performance bottleneck. Introducing Elasticsearch (ES) as a dedicated query engine provides superior search performance, flexible data modeling, and high scalability.

Ensuring reliable data synchronization between MySQL and ES is critical for real‑time accuracy, system stability, and a good user experience.

Synchronization Schemes

1. Synchronous Dual‑Write

Synchronous dual‑write means that when data is modified in the primary database (MySQL), the same changes are immediately written to ES, ensuring data consistency and reducing read pressure on MySQL.

Implementation Methods

Direct write in business code – simple but increases coupling and risk of errors.

Middleware – use message queues (Kafka), change‑data‑capture tools (Debezium), or ETL tools (Logstash) to capture MySQL changes and forward them to ES, decoupling business logic from sync logic.

Triggers or stored procedures – set up MySQL triggers or procedures to automatically write to ES on data changes, reducing code intrusion but adding load to MySQL.

Pros

Simple business logic.

High real‑time query capability.

Cons

Hard‑coded in business code; every MySQL write point needs ES write code.

Strong coupling between business code and sync logic.

Risk of data loss if dual‑write fails.

Additional write overhead can degrade overall performance.

2. Asynchronous Dual‑Write

Asynchronous dual‑write allows MySQL writes to be captured and propagated to ES asynchronously, reducing write latency on the primary database and improving overall system performance.

Advantages

Higher system availability – backup failures do not affect primary writes.

Reduced primary write latency – no need to wait for ES acknowledgment.

Multiple data sources can be added independently.

Disadvantages

Hard‑coded consumer code required for each new data source.

Increased system complexity due to message middleware.

Potential delay in data visibility because of asynchronous consumption.

Temporary data inconsistency between primary and backup stores; additional measures needed for eventual consistency.

3. Logstash Synchronization

Logstash is an open‑source data processing pipeline that can ingest data from multiple sources, transform it, and send it to a target 存储库. It can be used to synchronize MySQL with ES without modifying existing business code.

Pros

Non‑intrusive – no code changes required.

No strong coupling with business logic; original program performance remains unchanged.

Cons

Latency due to periodic polling; even with second‑level intervals, some delay persists.

Polling adds load to the database; can be mitigated by using a read‑replica.

Does not handle delete synchronization automatically; manual ES delete commands are needed.

ES document _id must match MySQL primary key.

4. Binlog Real‑Time Synchronization

Binlog (binary log) records all data‑changing SQL statements in MySQL. Real‑time sync tools (e.g., Canal, Maxwell) listen to binlog events and replicate changes to ES or other targets.

Advantages

Real‑time capture and synchronization.

Strong data consistency between source and target.

Flexibility to sync across various databases and storage systems.

Scalable and extensible to meet business needs.

No code intrusion; existing systems remain unchanged.

Disadvantages

Configuration and maintenance can be complex.

High‑concurrency environments may experience performance impact on MySQL due to binlog processing.

Tooling depends on binlog support; database version or configuration changes may require re‑configuration.

5. Canal Synchronization

Canal, an open‑source project from Alibaba, acts as a MySQL slave to subscribe to binlog events, parse them into JSON, and forward the data to ES via RESTful APIs, providing millisecond‑level latency without affecting the source database.

Sync Process

Canal server connects to MySQL master using the dump protocol.

MySQL master pushes binlog data to Canal, which parses it into JSON.

Canal client receives the JSON (via TCP or MQ) and writes it to ES.

6. Alibaba Cloud Data Transmission Service (DTS)

DTS offers real‑time data flow between heterogeneous data sources, supporting RDBMS, NoSQL, and OLAP. It provides initialization (full data load) and incremental synchronization, ensuring high availability, dynamic endpoint adaptation, and seamless scaling.

Key Features

High availability with active‑standby modules.

Dynamic adaptation to data source address changes.

Supports both OLTP‑to‑OLAP migration and continuous data sync.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Elasticsearch data synchronization databases

Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.