Big Data 7 min read

Migrating Over 2 Billion MySQL Records to Google BigQuery Using Kafka

This article details a real‑world solution for migrating more than two billion MySQL records to Google BigQuery by streaming data through Kafka, employing partitioned tables, data filtering, and incremental migration to avoid downtime and reduce storage costs.

Top Architect

Jan 18, 2021

Migrating Over 2 Billion MySQL Records to Google BigQuery Using Kafka

Introduction: a client had a MySQL table with over 20 billion rows, causing disk exhaustion, poor query performance, and schema issues. They needed a migration solution without long downtime.

Cloud solution: after evaluating options, they chose Google BigQuery because the data is structured, analytical, and does not require low latency. Direct queries to BigQuery are costly, so they use it only for analytics and backup.

Data streaming: they used Kafka, already in the project, to stream data from MySQL to the cloud, allowing buffering and low impact on MySQL. They built a simple Kafka producer to read MySQL and a consumer to write to BigQuery.

Challenges with Debezium/Kafka Connect: old MySQL version and missing auto‑increment column prevented using Debezium or Kafka Connect, so a custom producer/consumer was implemented.

Partitioned schema: they created a new BigQuery table partitioned by month with a sequential ID primary key. Data was loaded into partitions, old partitions were backed up and dropped to reclaim space.

Data cleaning: after loading, a second Kafka consumer filtered out unnecessary records (≈90 % of data) and inserted the cleaned data into a cleaned table, reducing storage and improving query performance.

Result: the pipeline moved all data to BigQuery without downtime, provided fast analytical queries, reclaimed storage through partition pruning, and kept operational MySQL queries separate to control costs.

Conclusion: using Kafka as a buffer and BigQuery as an analytical store enabled a seamless migration of billions of rows while maintaining service availability.

data migration big data BigQuery

Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.