How We Built Real-Time MySQL-to-Elasticsearch Sync with Binlog and Kafka
To meet growing e‑commerce search demands, the team replaced a MySQL‑based intermediate table with a real‑time binlog‑driven pipeline that streams changes through Kafka into Elasticsearch, detailing design choices, ordering and completeness guarantees, custom modules, and monitoring for sub‑second sync latency.
Background and Motivation
As the e‑commerce platform expanded, the volume of business data grew beyond what a single MySQL database could efficiently query, especially for multi‑dimensional searches on products and orders. Elasticsearch was introduced to satisfy search requirements, but this created a new challenge: keeping MySQL and Elasticsearch data in sync.
Existing Approach and Its Limitations
The initial solution used a MySQL intermediate table that mirrored the data needed for Elasticsearch indexes. A cron‑based script read rows whose UTime field indicated changes since the last run and wrote them to Elasticsearch. This approach suffered from several problems:
Any change to business fields required updates to both the primary MySQL tables and the intermediate table.
High‑frequency updates demanded dual writes to MySQL and Elasticsearch to maintain low latency.
As data volume grew, adding new columns to the intermediate table became time‑consuming.
Each new Elasticsearch mapping field required additional code to populate the intermediate table, increasing development effort.
Design Overview
To address these issues, the team adopted a binlog‑driven synchronization model. MySQL’s binary log (binlog), which records all changes for replication, serves as the source of truth. By consuming binlog events, the system can push changes directly to Elasticsearch, eliminating the need for an intermediate table and reducing latency.
Correctness Guarantees
Ordering : Kafka is used as the transport layer. Although Kafka cannot guarantee global ordering, it preserves order within each partition. Binlog events are hashed by primary key and routed to specific partitions, ensuring that all events for a given row arrive in order. A single consumer per partition processes updates sequentially, preserving update order in Elasticsearch.
Completeness : The consumer commits a Kafka offset only after the corresponding bulk request to Elasticsearch succeeds. This guarantees that no binlog event is lost even if the consumer crashes or a rebalance occurs. Duplicate consumption is harmless because Elasticsearch bulk operations are idempotent under the chosen workflow.
Technical Implementation
The solution builds on the open‑source project go-mysql-elasticsearch, customized for the company’s stack.
Core Modules
Configuration Parsing : Reads TOML or JSON files (or JSON strings from the Skipper config center) to obtain Kafka, Elasticsearch, and MySQL‑to‑Elasticsearch mapping settings.
Rule Engine : Determines which Elasticsearch index a binlog event should target, maps MySQL fields to index mappings, and defines the document ID. It also supports per‑field where clauses to filter out unwanted events.
Kafka Consumer : Connects to the Kafka cluster, authenticates via SASL, and consumes binlog messages from a specified offset. The client is a generic Golang Kafka consumer that can handle dynamic partition rebalances.
Binlog Parsing : The original parser handled raw MySQL replication packets; in this deployment the binlog payload is already JSON‑encoded by Canal, so the parser simply deserializes the JSON.
Elasticsearch Writer : Buffers transformed key‑value maps and issues bulk _bulk requests every 200 ms or when the buffer reaches a configurable size, reducing write pressure on Elasticsearch.
Custom Extensions
Upsert Support : When data for the same document arrives out of order, the bulk payload includes { "doc_as_upsert": true } to ensure the document is created if missing.
{
"doc_as_upsert" : true
}Field Filtering : Allows SQL‑like filters, e.g., select * from sometable where type in (1,2), so only selected rows are synchronized. select * from sometable where type in (1,2) Fast Incremental Sync : After a full‑load, the consumer seeks to the latest offsets for each partition at a given timestamp, enabling immediate consumption of recent binlog events without replaying the entire backlog.
Microservice & Config Center Integration : The sync service runs as a microservice managed by the company’s internal deployment platform. Configuration is stored in a centralized config center, replacing static TOML files and simplifying per‑environment management.
Logging and Monitoring
All binlog events, Kafka messages, bulk request payloads, and Elasticsearch responses are logged via the existing ELK stack. Two primary metrics are monitored:
Sync Latency : Time from binlog generation to successful write in Elasticsearch, typically around 1 second for order data. Alerts are triggered via ElastAlert when latency exceeds thresholds.
Heartbeat : A dedicated table is updated every minute by a cron job; the sync service verifies that the latest heartbeat record appears in Elasticsearch, alerting if the pipeline stalls.
Conclusion
The binlog‑driven, Kafka‑backed pipeline provides sub‑second synchronization of MySQL order data to Elasticsearch, eliminating the need for a heavyweight intermediate table and reducing development overhead. The open‑source foundation combined with targeted customizations offers a reusable pattern for any business requiring real‑time search indexing.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
