How We Migrated Over 2 Billion MySQL Records to Google BigQuery Using Kafka
This case study explains how a client with a 2‑billion‑row MySQL table avoided disk exhaustion and poor query performance by streaming data through Kafka to Google BigQuery, partitioning the data, and cleaning it to reclaim storage—all without downtime.
Our client faced a MySQL problem: a massive table with over 2 billion rows that kept growing, risking disk exhaustion, poor query performance, and a badly designed schema that made data analysis difficult.
They needed a solution that would address these issues without costly maintenance windows that could take the application offline.
After evaluating several options, we chose to migrate the data to the cloud, specifically Google BigQuery. The client preferred Google Cloud, the data was structured and analytical, and low latency was not required, making BigQuery a perfect fit. However, because large‑scale queries on BigQuery can be expensive, we decided to use it only for analysis and backup, not as the primary query engine.
For data flow to the cloud we used Kafka, a tool already familiar to the project. Kafka allowed us to push all data into a stream, retain it briefly, and then forward it to the destination, keeping the load on the MySQL cluster low and providing a fallback if BigQuery proved too costly.
Streaming data from MySQL to Kafka presented challenges: Debezium and Kafka Connect could not be used because the MySQL version was too old and the table lacked an auto‑increment column, which would cause data loss. We therefore built a simple Kafka producer that safely queried the MySQL data without loss and a consumer that sent the data to BigQuery.
To reclaim storage we designed a new schema using a serial ID as the primary key and partitioned the table by month. By backing up and deleting old partitions we freed space. After migrating all records, we deployed a new version of the application that writes to the new partitioned table and deletes the old one, ensuring sufficient free space throughout the migration.
Once data was in BigQuery, we analyzed it and discovered that about 90 % of the records were unnecessary. We created a new Kafka consumer that filtered out unwanted records and inserted the needed ones into a "cleaned" table.
After cleaning, we updated the application to read from the cleaned table while continuing to stream data into the partitioned table. This solution solved the client’s performance and storage problems: partitioning eliminated storage issues, and data cleaning and indexing improved query performance. BigQuery is used only for specific analytical queries, while other queries remain on MySQL, keeping costs low. All of this was achieved without any downtime.
In summary, by streaming data through Kafka to BigQuery and employing partitioning and cleaning strategies, we provided a robust, low‑cost solution that addressed the client’s challenges without impacting availability.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
