Incremental Synchronization of Massive HBase Data to a Data Warehouse: Solution Overview and Performance Evaluation
The paper proposes a generic, timeRange‑based incremental extraction method for synchronizing tens of billions of HBase rows to a data warehouse, demonstrating that it avoids full‑table scans, automatically detects schema changes, and delivers significantly lower latency than Hive mapping or timestamp‑based approaches, and has been integrated into a unified big‑data platform.
Background
Currently, a portion of business data is stored in HBase with a volume of tens of billions of rows. This data needs to be incrementally synchronized to a data warehouse for offline analysis. The existing approach uses Hive mapping tables on HBase, which suffers from several drawbacks:
Full‑table scans on HBase cause high load and slow synchronization.
Schema changes in HBase require rebuilding Hive mapping tables, complicating permission management.
Lack of effective monitoring for HBase schema changes leads to delayed awareness of new fields.
Business systems often do not update timestamp fields, causing data loss during incremental extraction.
Unnoticed field additions force back‑tracking and re‑processing of data.
To address these issues, a generic solution for incremental HBase data ingestion into the warehouse is proposed.
Solution Overview
Section 2 outlines the proposed workflow (see diagram). The solution is evaluated through three implementation schemes.
2.1 Data Ingestion Process
2.2 Experimental Comparison of HBase Ingestion Schemes
The three schemes are analyzed as follows:
Scheme 1 – Hive Mapping Table
Simple to implement but puts direct read pressure on the business HBase database, violating the principle of minimal impact on production systems.
Increases coupling between the warehouse and business systems, breaking decoupling requirements.
Conclusion: Not suitable for the target scenario.
Scheme 2 – Incremental Extraction Based on Business Timestamp
Requires a full table scan and timestamp filtering; performance degrades sharply with tens of millions to billions of rows.
HBase does not automatically update timestamps, leading to potential data loss when business systems omit timestamp updates.
Conclusion: Risky and inefficient.
Scheme 3 – Incremental Extraction Using HBase timeRange
Leverages HBase’s built‑in timeRange feature (server‑side timestamps) to first filter rowkeys within the desired time window, then performs a Get operation on those rowkeys to retrieve full column data. This approach:
Avoids full‑table scans, dramatically improving speed.
Provides automatic monitoring of schema changes, reducing the chance of missing new fields.
Minimizes the need for data back‑tracking.
Conclusion: Selected as the optimal solution for massive HBase data ingestion.
2.3 Solution Selection and Implementation Principle
By specifying a timeRange during Scan, only relevant rowkeys are returned, eliminating the need for a full scan. A subsequent Get retrieves all columns for each rowkey because Scan only returns columns that changed within the time range. The retrieved columns are then compared with Hive table metadata to generate timely alerts for schema changes, preventing incomplete synchronizations.
3. Performance Comparison
Runtime results (seconds) are shown in the following chart (see image). The timeRange‑based scheme achieves significantly lower latency compared with the other two approaches.
4. Summary and Outlook
Efficient and accurate synchronization of business data to the data warehouse is fundamental to warehouse construction. The proposed solution resolves major pain points, ensures high‑quality data ingestion, and lays a solid foundation for future warehouse development. The approach has been integrated into a unified big‑data development platform, enabling configuration‑driven, timeRange‑based incremental sync.
Future work includes exploring Phoenix secondary indexes to further optimize synchronization performance.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.