How We Achieved Near‑Real‑Time MySQL‑to‑Elasticsearch Sync Using Binlog and Kafka
This article explains why traditional MySQL queries no longer meet the growing e‑commerce data needs, describes the limitations of a MySQL‑to‑Elasticsearch intermediate table, and details a binlog‑driven, Kafka‑based pipeline with custom modules, upsert handling, filtering, and monitoring to ensure fast, reliable data synchronization.
1. Why We Need It
As Mafengwo's business expands, the volume of product and order data has outgrown plain MySQL queries, especially for multi‑dimensional searches. Storing business data in Elasticsearch solves search requirements, but introduces the challenge of keeping MySQL and Elasticsearch data synchronized.
2. Existing Methods and Problems
Our current solution uses a MySQL intermediate table that mirrors the data needed for Elasticsearch indexes. Incremental data is extracted via a Crontab script that reads rows with an updated timestamp and writes them to Elasticsearch. This approach forces business logic to update both MySQL and the intermediate table, and any requirement for near‑real‑time Elasticsearch data adds extra write paths.
When the intermediate table grows large, adding new columns for Elasticsearch mappings becomes time‑consuming, and developers must maintain additional code to map new fields, increasing development cost.
3. Solution Design
Overall Idea – Instead of polling, we capture MySQL changes directly from the binary log (Binlog). Binlog, used by MySQL replication, provides the exact data changes needed for Elasticsearch and satisfies real‑time sync requirements.
We adopted the open‑source go-mysql-elasticsearch project and customized it for Mafengwo's stack.
3.1 Data Sync Correctness Guarantees
Binlog data is confidential, so we expose it via Kafka. Consumers must request permission and read the data as a Kafka Consumer Group. To ensure order and completeness:
Order : Each Binlog entry is hashed by its primary key to a specific Kafka partition, guaranteeing that all changes for a given row stay in the same partition and are consumed in order.
Completeness : We commit a Kafka offset only after the corresponding document is successfully written to Elasticsearch, preventing data loss even during consumer rebalance or failures.
4. Technical Implementation
The system consists of several modules:
Configuration Parsing Module – Parses TOML/JSON files or configuration‑center JSON strings for Kafka, Elasticsearch, MySQL‑to‑Elasticsearch field mappings, etc.
Rule Module – Determines which Elasticsearch index a Binlog entry belongs to, the document ID field, and field‑to‑mapping conversions. Supports custom WHERE filters to drop unnecessary rows.
Kafka Module – Connects to Kafka, authenticates via SASL, and consumes from specified offsets.
Binlog Parsing Module – In our case the Binlog is already converted to JSON by Canal, so the parser simply deserializes the JSON.
Elasticsearch Module – Buffers key‑value maps and bulk‑writes to Elasticsearch every 200 ms or when a size threshold is reached.
Key customizations include:
Upsert Support – Adds {"doc_as_upsert": true} to the bulk request so that later updates can create missing documents.
Filtering – Allows SQL‑like conditions (e.g., SELECT * FROM sometable WHERE type IN (1,2)) to sync only relevant rows.
Fast Incremental Sync – Commits Kafka offsets at a specific timestamp before starting the consumer group, enabling immediate consumption of recent Binlog entries.
Microservice & Configuration Center – Deploys as a microservice with centralized configuration for rapid onboarding and scaling.
5. Logging and Monitoring
All Binlog flows are logged (Kafka message, bulk payload, Elasticsearch response). Two main metrics are monitored:
Sync latency – time from Binlog generation to Elasticsearch write (average ~1 s for order data).
Heartbeat – a dedicated table updated every minute; if the latest heartbeat is not reflected in Elasticsearch, an alert is triggered.
Alerts are sent via ElastAlert when latency spikes or heartbeat checks fail.
6. Conclusion
The most critical use case is e‑commerce order indexing, where sync latency remains around 1 second. This localized implementation of an open‑source Binlog‑to‑Elasticsearch pipeline provides a reference for other teams needing reliable, low‑latency data synchronization.
Author: Zhang Kun, Senior R&D Engineer, Mafengwo E‑commerce Team.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Mafengwo Technology
External communication platform of the Mafengwo Technology team, regularly sharing articles on advanced tech practices, tech exchange events, and recruitment.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
