Migrating Over 2 Billion MySQL Records to the Cloud with Kafka and BigQuery
Facing a MySQL table with over 2 billion continuously growing rows, the team designed a cloud‑based solution using Kafka to stream data into Google BigQuery, applied partitioned schemas and data cleaning to improve query performance, reduce storage costs, and avoid downtime.
Our client faced a MySQL table with more than 2 billion rows that was growing continuously, causing disk‑space exhaustion, poor query performance, and a poorly designed schema that hindered data analysis.
To address these issues without costly maintenance windows, we migrated the data to the cloud, selecting Google BigQuery because the client preferred Google’s ecosystem, the data was structured and analytical, and low latency was not required.
1. Cloud solution as a remedy – After testing, we confirmed BigQuery could handle the analytical workload, but direct queries from the application would be expensive, so we limited BigQuery use to analysis and backup.
2. Streaming data to the cloud – We chose Kafka as the streaming backbone because it was already in use, allowed us to buffer data before forwarding it, and provided a fallback if BigQuery became too costly.
3. Moving MySQL data to Kafka – Due to an old MySQL version, Debezium and Kafka Connect were unsuitable, so we built a simple Kafka producer that safely reads rows without losing data and a consumer that writes to BigQuery.
4. Reclaiming storage with partitioning – We created a new schema with a sequential ID primary key and month‑based partitions, enabling us to drop old partitions and free space while continuously migrating data.
5. Reclaiming storage by cleaning data – A second Kafka consumer filters out unnecessary records (about 90 % of the data) and writes the retained rows into a “clean” table, further reducing storage and improving query speed.
6. Summary – By streaming all data through Kafka into BigQuery, using partitioned tables and data‑cleaning pipelines, we solved the client’s performance and storage problems without any downtime.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
