Big Data 14 min read

Exploring Real-Time Data Warehouse Practices Based on HBase

The article details the evolution from an offline to a real‑time HBase data warehouse, covering business scenarios, the use of Maxwell for MySQL‑to‑Kafka ingestion, Phoenix for SQL access, CDH cluster tuning, monitoring, and several production case studies.

DataFunTalk

Oct 14, 2018

Exploring Real-Time Data Warehouse Practices Based on HBase

The piece, originally presented by Shanghai Jiuyue big‑data engineer Wu Jikeng at the fourth China HBase MeetUp, walks through six major topics: the first‑generation offline warehouse, the second‑generation real‑time warehouse, business scenarios and development workflow, CDH cluster configuration and tuning, and two production case studies.

The offline warehouse, launched in March of the previous year, relied on OMS/WMS data, SQOOP (later replaced by DataX) for extraction, and Hive for T+1 batch loads; analytical queries were served via Kylin, Superset, and Saiku, but it could not meet real‑time needs for same‑day data.

For the real‑time warehouse, Maxwell was chosen to capture MySQL changes, output JSON to Kafka, and store the data in HBase. Maxwell was selected for its bootstrapping capability, checkpoint‑based recovery, ordered partitioning, and support for schema evolution via alerts.

HBase was preferred over alternatives like Pudu due to its distributed scalability, random read/write performance, and support for millions of columns. Phoenix was added to provide SQL access, secondary indexes, and Spark integration, while a salted table design mitigated hotspot issues.

The CDH/HBase stack evolved through several versions (phoenix‑for‑cloudera‑4.9, HBase‑1.2, CDH‑5.9/5.12) with custom builds to enable Phoenix on CDH and fix bugs such as SYSTEM.MUTEX errors and timezone handling by setting DEFAULT_IS_NAMESPACE_MAPPING_ENABLED=true and adjusting DateUtil.java to Asia/Shanghai.

The real‑time architecture places MySQL/OMS/WMS data behind Maxwell, which streams JSON to Kafka; Spark Streaming consumes the data, writes to HBase via Phoenix JDBC, and stores results in Redis, Elasticsearch, and a BI tool (Superset/saiku). Scheduling moved from Azkaban to Airflow and finally Oozie.

Data modeling follows a classic star schema: base tables in Phoenix mirror MySQL structures, fact tables capture real‑time orders, and dimension tables store relatively static reference data. Data validation compares MySQL and HBase snapshots, triggering full‑table scans when mismatches occur.

Cluster tuning is divided into six areas: Linux parameters (file handles, threads, swap), HDFS settings (timeouts, handlers), HBase configurations (region server memory, compression SNAPPY), GC tuning (switching from CMS to G1), monitoring (read/write QPS, file descriptor usage, Zookeeper connections), and bug fixes (e.g., PHOENIX‑4056, SPARK‑22968). Monitoring dashboards track HBase read/write spikes, region server file handles, and Zookeeper connection counts to prevent service disruption.

Two production incidents are described: a three‑node RIT caused by power loss, resolved by restarting region servers and using HBCK; and a high‑memory scenario leading to region server crashes, mitigated by restarting the service monitor. Both cases relied on Maxwell for data recovery.

Finally, the article includes author information, recruitment details for big‑data engineers, and promotional links, but the technical content provides a comprehensive guide to building and operating a real‑time HBase data warehouse.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

kafka HBase Real-Time Data Warehouse Spark Streaming Phoenix

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.