Big Data 24 min read

Three Billion‑Scale MySQL‑to‑HBase Synchronization Solutions and Practical Implementation

This article presents a comprehensive guide for synchronizing massive MySQL datasets to HBase, covering environment preparation, fast MySQL data loading techniques, and three practical pipelines—Sqoop, Kafka‑Thrift, and Kafka‑Flink—along with performance comparisons and optimization tips for large‑scale data processing.

Top Architect
Top Architect
Top Architect
Three Billion‑Scale MySQL‑to‑HBase Synchronization Solutions and Practical Implementation

The guide starts by preparing a pseudo‑distributed Hadoop environment on Ubuntu 16.04, installing Hadoop 3.0.2, HBase 1.4.9, Phoenix, Zookeeper, Kafka, Maxwell, and Flink, and configuring each component (Java, users, SSH, HDFS, YARN, Zookeeper, etc.).

It then describes fast MySQL data insertion methods, including load data infile, batch inserts with pymysql, and multi‑process Python loading, highlighting that load data infile on a MyISAM table can import billions of rows in about 1 hour, while programmatic inserts take several hours.

Three synchronization pipelines are presented:

Sqoop : a shell loop splits the MySQL table by ID ranges and runs sqoop import with --hbase-create-table, --column-family info, and -m 4 parallelism. The process takes roughly 50 hours for the full dataset.

Kafka‑Thrift (Maxwell) : enables MySQL binlog, uses Maxwell to stream binlog events as JSON to Kafka, and consumes them with a Python Thrift client that writes to HBase. This method achieves about 7 hours for batch insertion.

Kafka‑Flink : Maxwell streams binlog to Kafka, Flink reads the topic with FlinkKafkaConsumer011, processes records in 3‑second windows, and sinks them to HBase via a batch put. This pipeline completes in 3‑7 hours and offers the best throughput.

Performance tests compare raw HBase scans (≈ 1 hour), Phoenix queries (≈ 33 minutes), and coprocessor scans (≈ 31 minutes). The article concludes with practical recommendations: split large data, use batch inserts, prefer Flink for stability and speed, and tune Phoenix and HBase settings for large‑scale queries.

Finally, the article provides QR‑code instructions for joining an architecture community and links to related tutorials.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataFlinkKafkamysqlHBasedata synchronization
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.