Big Data 8 min read

Incremental Synchronization of Massive HBase Data to a Data Warehouse: Solution Overview and Performance Evaluation

The paper proposes a generic, timeRange‑based incremental extraction method for synchronizing tens of billions of HBase rows to a data warehouse, demonstrating that it avoids full‑table scans, automatically detects schema changes, and delivers significantly lower latency than Hive mapping or timestamp‑based approaches, and has been integrated into a unified big‑data platform.

vivo Internet Technology

Mar 9, 2022

Incremental Synchronization of Massive HBase Data to a Data Warehouse: Solution Overview and Performance Evaluation

Background

Currently, a portion of business data is stored in HBase with a volume of tens of billions of rows. This data needs to be incrementally synchronized to a data warehouse for offline analysis. The existing approach uses Hive mapping tables on HBase, which suffers from several drawbacks:

Full‑table scans on HBase cause high load and slow synchronization.

Schema changes in HBase require rebuilding Hive mapping tables, complicating permission management.

Lack of effective monitoring for HBase schema changes leads to delayed awareness of new fields.

Business systems often do not update timestamp fields, causing data loss during incremental extraction.

Unnoticed field additions force back‑tracking and re‑processing of data.

To address these issues, a generic solution for incremental HBase data ingestion into the warehouse is proposed.

Solution Overview

Section 2 outlines the proposed workflow (see diagram). The solution is evaluated through three implementation schemes.

2.1 Data Ingestion Process

2.2 Experimental Comparison of HBase Ingestion Schemes

The three schemes are analyzed as follows:

Scheme 1 – Hive Mapping Table

Simple to implement but puts direct read pressure on the business HBase database, violating the principle of minimal impact on production systems.

Increases coupling between the warehouse and business systems, breaking decoupling requirements.

Conclusion: Not suitable for the target scenario.

Scheme 2 – Incremental Extraction Based on Business Timestamp

Requires a full table scan and timestamp filtering; performance degrades sharply with tens of millions to billions of rows.

HBase does not automatically update timestamps, leading to potential data loss when business systems omit timestamp updates.

Conclusion: Risky and inefficient.

Scheme 3 – Incremental Extraction Using HBase timeRange

Leverages HBase’s built‑in timeRange feature (server‑side timestamps) to first filter rowkeys within the desired time window, then performs a Get operation on those rowkeys to retrieve full column data. This approach:

Avoids full‑table scans, dramatically improving speed.

Provides automatic monitoring of schema changes, reducing the chance of missing new fields.

Minimizes the need for data back‑tracking.

Conclusion: Selected as the optimal solution for massive HBase data ingestion.

2.3 Solution Selection and Implementation Principle

By specifying a timeRange during Scan, only relevant rowkeys are returned, eliminating the need for a full scan. A subsequent Get retrieves all columns for each rowkey because Scan only returns columns that changed within the time range. The retrieved columns are then compared with Hive table metadata to generate timely alerts for schema changes, preventing incomplete synchronizations.

3. Performance Comparison

Runtime results (seconds) are shown in the following chart (see image). The timeRange‑based scheme achieves significantly lower latency compared with the other two approaches.

4. Summary and Outlook

Efficient and accurate synchronization of business data to the data warehouse is fundamental to warehouse construction. The proposed solution resolves major pain points, ensures high‑quality data ingestion, and lays a solid foundation for future warehouse development. The approach has been integrated into a unified big‑data development platform, enabling configuration‑driven, timeRange‑based incremental sync.

Future work includes exploring Phoenix secondary indexes to further optimize synchronization performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data HBase performance evaluation incremental sync TimeRange

Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.