Big Data 21 min read

How Vimur Leverages Kafka for Real‑Time Data Migration and Synchronization

This article details how Vimur, a Kafka‑based real‑time data pipeline, addresses the challenges of service splitting, database sharding, data migration, and synchronization by using CDC, a unified Avro format, and a change distribution platform to support search indexing, cache refresh, and reactive architectures.

CoolHome R&D Department

Sep 18, 2017

How Vimur Leverages Kafka for Real‑Time Data Migration and Synchronization

Facing explosive business data growth and rapid team expansion, we transitioned from a monolithic Java WebApp with a single MySQL instance to a service‑oriented architecture (SOA) with database sharding, encountering new challenges in data migration and synchronization.

Vimur, built on Apache Kafka, provides a real‑time data pipeline that solves these problems and supports use cases such as search index construction, cache refresh, and reactive architecture.

Origin

Before the transformation we used a typical monolith: a Java WebApp behind Nginx with session‑sticky load balancing, and a MySQL master with several slaves for read/write separation. The goal was to split services and shard databases with minimal impact on ongoing development.

We adopted a gradual, progressive split approach:

Extract sub‑services for each domain and expose database operations via RPC.

Replace direct table accesses in other services with RPC calls.

Split the domain tables and use data synchronization to keep old and new tables consistent.

Migrate database operations in the sub‑service to the new tables in batches.

After full migration, cut the sync and complete the split.

This method enables smooth migration but raises three key issues: data consistency between old and new tables, support for heterogeneous migration, and real‑time synchronization.

Dual write writes to both old and new tables, but introduces costly distributed transaction problems. Change Data Capture (CDC) reads transaction logs to capture changes, avoiding consistency issues but lacking a unified protocol across data sources.

We chose CDC for its higher consistency guarantees and the availability of open‑source components that mitigate protocol heterogeneity.

Architecture Design

A single CDC module is insufficient because downstream consumers may not be ready. Therefore we introduced a change distribution platform that provides buffering, supports multiple consumers at different speeds, and decouples CDC from consumers.

We also defined a unified data format for efficient and safe communication across components.

Vimur is a real‑time data pipeline that captures changes via CDC, publishes them in a unified format to the distribution platform, and allows clients to consume real‑time changes.

Open‑Source Comparison

We evaluated several open‑source solutions:

Databus (LinkedIn)

Yelp’s data pipeline

Otter (Alibaba)

Debezium (Red Hat)

All follow the same basic idea: capture database changes and publish them to an intermediate store for downstream consumption.

Databus

Databus’s MySQL CDC module is immature and relies on an external component; its custom relay makes integration harder.

Otter

Otter is stable but cannot easily aggregate multiple tables into a new sharded table; a workaround is to write CDC output to a message queue and implement aggregation downstream.

Yelp’s data pipeline

Uses MySQL‑Streamer to write binlog changes to Kafka and provides a schema registry and Python client, but its Python stack mismatches our Java ecosystem.

Debezium

Supports MySQL, MongoDB, PostgreSQL (with Oracle and Cassandra in development).

Snapshot mode imports existing table data as insert events.

Leverages Kafka log compaction for permanent storage.

Built on Kafka Connect for high availability.

Active community with Red Hat engineers.

We selected Debezium + Kafka as the core components and Apache Avro as the unified data format.

CDC Module

Two typical implementation approaches for change capture:

Incremental queries based on auto‑increment columns or last‑modified timestamps.

Real‑time subscription to transaction logs or slave replication.

Example of incremental query: SELECT * FROM foo WHERE lastmodified > ${last_query_time} The first method suffers from latency, extra load, and cannot capture deletes without a flag column.

The second method, used by the open‑source solutions, reads MySQL binlog. By replacing the slave with a CDC module that mimics the MySQL slave protocol, we receive binlog pushes.

CDC modules persist consumed binlog positions (GTID preferred) to avoid data loss on restart.

Debezium excels at capturing schema changes from DDL statements and storing schema snapshots in a backup Kafka topic, enabling seamless recovery.

Its snapshot feature also captures existing data as insert events, allowing a single client to handle both full data load and incremental changes.

Change Distribution Platform

We evaluated NoSQL stores (e.g., Cassandra) and message queues (Kafka, RabbitMQ). MQs provide better throughput and fit our stateless consumer model. Kafka’s log compaction lets us retain the latest state per key while discarding superseded updates.

Example of log compaction: four change messages are produced for a table; after compaction only the latest records for each primary key remain.

Ensuring ordering per row is sufficient; using the primary key as the Kafka message key guarantees that all changes for a row land in the same partition and are processed in order.

Unified Data Format

We chose Apache Avro for its schema‑driven contract, which mitigates breaking changes compared to raw JSON. Avro schemas are defined in JSON and support evolution rules, allowing producers and consumers to evolve independently as long as they follow compatibility guidelines.

In our scenario, database schema changes trigger corresponding Avro schema updates, and downstream consumers rely on the agreed evolution rules to remain stable.

Application Summary

The topology centers on Kafka as the change distribution platform, with MirrorMaker handling cross‑data‑center replication and Kafka Connect running Debezium tasks for high availability.

Vimur’s typical data synchronization flow is illustrated below.

A typical data migration involves service splitting and sharding, as shown.

Two approaches to enrich new tables with additional columns are:

Back‑query the source database (simple but adds load).

Stream Join using frameworks like Flink or Kafka Streams (more complex but offloads load).

Vimur also solves cross‑database joins by aggregating required tables into a search engine or NoSQL store as documents.

Beyond migration, Vimur powers real‑time search index building, business event notifications, cache refresh, and reactive architecture scenarios. If you face complex data‑layer synchronization, migration, or indexing challenges, consider a CDC‑based real‑time data pipeline.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Migration Microservices kafka CDC Debezium Real-time Data Pipeline Avro

Written by

CoolHome R&D Department

Official account of CoolHome R&D Department, sharing technology and innovation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.