Big Data 24 min read

Synchronizing Billion-Row MySQL Data to HBase: Three Practical Schemes and Implementation Guide

This comprehensive guide details three practical methods for syncing massive MySQL datasets to HBase—including Sqoop, Kafka‑Thrift, and Flink pipelines—covering environment setup, configuration, code examples, performance comparisons, and optimization tips for large‑scale data ingestion and querying.

Architecture Digest
Architecture Digest
Architecture Digest
Synchronizing Billion-Row MySQL Data to HBase: Three Practical Schemes and Implementation Guide

This article introduces three synchronization schemes for moving massive MySQL data to HBase, focusing on practical implementation and performance evaluation.

1. Environment Preparation – The guide starts with setting up a pseudo‑distributed Hadoop cluster on Ubuntu 16.04, installing Java, Hadoop, HBase, Phoenix, Zookeeper, Kafka, Maxwell, and Flink, and configuring each component with detailed commands and configuration files.

2. MySQL Data Insertion – It compares data loading methods: LOAD DATA INFILE, Python batch insertion using pymysql with executemany, and multi‑process Python insertion after splitting the source file.

with open('/home/light/mysql/gps1.txt', 'r') as fp:
    for line in fp:
        # process and batch insert
        if count % 70000 == 0:
            self.cur.executemany(sql, data_list)
            self.conn.commit()

3. Synchronization Schemes

Sqoop – Uses sqoop import with split‑by ID to import data into HBase, requiring virtual memory adjustments.

Kafka‑Thrift – Enables MySQL binlog, uses Maxwell to capture changes, publishes JSON to Kafka, and consumes with a Thrift client that writes to HBase.

Kafka‑Flink – Consumes Kafka topics with Flink, applies windowed processing, and sinks data into HBase using a custom Flink sink.

Each scheme includes full command‑line examples, configuration snippets (e.g., my.cnf for binlog, hbase-site.xml for HBase), and scripts for starting/stopping services.

Performance Comparison – The guide presents timing results: Sqoop takes ~50 h, Kafka‑Thrift single‑row ~50 h, Kafka‑Thrift batch ~7 h, and Flink ~3‑7 h. It also compares HBase native scans, Phoenix queries, and coprocessor scans, showing Phoenix and coprocessor are significantly faster.

Optimization Tips – Emphasizes data partitioning, batch inserts, disabling virtual memory checks for Sqoop, proper Zookeeper configuration, and tuning Flink window sizes.

Overall, the article serves as a step‑by‑step tutorial for engineers needing to migrate and synchronize large‑scale relational data into HBase using various open‑source tools.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataFlinkKafkamysqlHBasedata synchronization
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.