Backend Development 13 min read

How We Built Real-Time MySQL-to-Elasticsearch Sync with Binlog and Kafka

To meet growing e‑commerce search demands, the team replaced a MySQL‑based intermediate table with a real‑time binlog‑driven pipeline that streams changes through Kafka into Elasticsearch, detailing design choices, ordering and completeness guarantees, custom modules, and monitoring for sub‑second sync latency.

dbaplus Community

Sep 3, 2019

How We Built Real-Time MySQL-to-Elasticsearch Sync with Binlog and Kafka

Background and Motivation

As the e‑commerce platform expanded, the volume of business data grew beyond what a single MySQL database could efficiently query, especially for multi‑dimensional searches on products and orders. Elasticsearch was introduced to satisfy search requirements, but this created a new challenge: keeping MySQL and Elasticsearch data in sync.

Existing Approach and Its Limitations

The initial solution used a MySQL intermediate table that mirrored the data needed for Elasticsearch indexes. A cron‑based script read rows whose UTime field indicated changes since the last run and wrote them to Elasticsearch. This approach suffered from several problems:

Any change to business fields required updates to both the primary MySQL tables and the intermediate table.

High‑frequency updates demanded dual writes to MySQL and Elasticsearch to maintain low latency.

As data volume grew, adding new columns to the intermediate table became time‑consuming.

Each new Elasticsearch mapping field required additional code to populate the intermediate table, increasing development effort.

Design Overview

To address these issues, the team adopted a binlog‑driven synchronization model. MySQL’s binary log (binlog), which records all changes for replication, serves as the source of truth. By consuming binlog events, the system can push changes directly to Elasticsearch, eliminating the need for an intermediate table and reducing latency.

Correctness Guarantees

Ordering : Kafka is used as the transport layer. Although Kafka cannot guarantee global ordering, it preserves order within each partition. Binlog events are hashed by primary key and routed to specific partitions, ensuring that all events for a given row arrive in order. A single consumer per partition processes updates sequentially, preserving update order in Elasticsearch.

Completeness : The consumer commits a Kafka offset only after the corresponding bulk request to Elasticsearch succeeds. This guarantees that no binlog event is lost even if the consumer crashes or a rebalance occurs. Duplicate consumption is harmless because Elasticsearch bulk operations are idempotent under the chosen workflow.

Technical Implementation

The solution builds on the open‑source project go-mysql-elasticsearch, customized for the company’s stack.

Core Modules

Configuration Parsing : Reads TOML or JSON files (or JSON strings from the Skipper config center) to obtain Kafka, Elasticsearch, and MySQL‑to‑Elasticsearch mapping settings.

Rule Engine : Determines which Elasticsearch index a binlog event should target, maps MySQL fields to index mappings, and defines the document ID. It also supports per‑field where clauses to filter out unwanted events.

Kafka Consumer : Connects to the Kafka cluster, authenticates via SASL, and consumes binlog messages from a specified offset. The client is a generic Golang Kafka consumer that can handle dynamic partition rebalances.

Binlog Parsing : The original parser handled raw MySQL replication packets; in this deployment the binlog payload is already JSON‑encoded by Canal, so the parser simply deserializes the JSON.

Elasticsearch Writer : Buffers transformed key‑value maps and issues bulk _bulk requests every 200 ms or when the buffer reaches a configurable size, reducing write pressure on Elasticsearch.

Custom Extensions

Upsert Support : When data for the same document arrives out of order, the bulk payload includes { "doc_as_upsert": true } to ensure the document is created if missing.

{
  "doc_as_upsert" : true
}

Field Filtering : Allows SQL‑like filters, e.g., select * from sometable where type in (1,2), so only selected rows are synchronized. select * from sometable where type in (1,2) Fast Incremental Sync : After a full‑load, the consumer seeks to the latest offsets for each partition at a given timestamp, enabling immediate consumption of recent binlog events without replaying the entire backlog.

Microservice & Config Center Integration : The sync service runs as a microservice managed by the company’s internal deployment platform. Configuration is stored in a centralized config center, replacing static TOML files and simplifying per‑environment management.

Logging and Monitoring

All binlog events, Kafka messages, bulk request payloads, and Elasticsearch responses are logged via the existing ELK stack. Two primary metrics are monitored:

Sync Latency : Time from binlog generation to successful write in Elasticsearch, typically around 1 second for order data. Alerts are triggered via ElastAlert when latency exceeds thresholds.

Heartbeat : A dedicated table is updated every minute by a cron job; the sync service verifies that the latest heartbeat record appears in Elasticsearch, alerting if the pipeline stalls.

Conclusion

The binlog‑driven, Kafka‑backed pipeline provides sub‑second synchronization of MySQL order data to Elasticsearch, eliminating the need for a heavyweight intermediate table and reducing development overhead. The open‑source foundation combined with targeted customizations offers a reusable pattern for any business requiring real‑time search indexing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Elasticsearch kafka mysql Binlog Data synchronization

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.