Big Data 13 min read

Applying HBase in a Risk‑Control System and High‑Availability Practices

This article summarizes Guo Dongdong’s presentation on leveraging HBase for a risk‑control platform, detailing its architecture, data import/export mechanisms, indexing, region server recovery challenges, monitoring, SQL interception, dual‑cluster high‑availability, and future enhancements for large‑scale, low‑latency big‑data services.

DataFunTalk
DataFunTalk
DataFunTalk
Applying HBase in a Risk‑Control System and High‑Availability Practices

This article is edited and compiled from Guo Dongdong, senior technical expert at WuCai, who shared “Applying HBase in a Risk‑Control System and High‑Availability Practices” at the 3rd China HBase Technology Community MeetUp in Hangzhou.

I mainly work on distributed services, distributed computing, and big‑data research, currently responsible for the big‑data platform at WuCai. The platform adopts the HBase framework. From a business perspective, each line has its own MySQL instance; risk control needs an independent data mart and fast data acquisition for rapid deployment. From a data perspective, we have hundreds of billions of rows to import, and HBase supports massive data ingestion with millisecond‑level latency. Combined with Hive/Spark, HBase enables large‑scale analytics for risk analysis.

HBase’s architecture supports binlog synchronization and schema management, converting schema definitions into database schemas for SQL queries, which simplifies development by allowing row‑key queries instead of ID queries. It also provides indexing support, crucial for the diverse query patterns in risk control.

To move data into the warehouse, we built an export tool that leverages HBase’s export capability to extract files, which are then parsed with Spark and loaded into the warehouse. Incremental export allows exporting only today’s data, minimizing impact on the database. Bulk load is highly efficient; we import billions of tables daily into HBase, generating index tables simultaneously. Unlike other databases that require row‑by‑row SQL inserts, HBase can import files directly while supporting online large‑scale operations.

The HBase‑based risk control system is now online. Other business lines also use HBase via a gateway that exposes simple interfaces such as get and scan, with permission checks. This “database‑as‑a‑service” approach allows features like daily full‑data updates and real‑time queries for downstream services.

After six months of operation, the user base grew from a single risk‑control client to dozens, data volume increased twenty‑fold, and traffic rose hundreds of times, reaching thousands of times during peak periods. This caused CPU usage spikes, ambiguous query sources, and region‑server hangs that could take hours to recover, despite HBase’s automatic failover.

When a region server crashes, Phoenix’s index component exacerbates the problem. During writes, a Coprocessor Host updates both the data table and the index table, generating an indexer and a WAL updater. Even if the region server fails, the WAL ensures consistency, replaying updates to the index region. However, if the index region is pending, replay can fail, and Phoenix’s Index Write Error Handler may disable the index table and abort the region server, leading to full‑table scans that consume massive memory.

Without indexes, a region recovery (from Opening to Open) takes only a few seconds, but with mainstore flushes and WAL replay it can take minutes, especially when handling hundreds of regions.

To mitigate these issues we added upper‑layer controls rather than modifying HBase itself. First, we implemented monitoring that triggers alerts if recovery exceeds ten seconds. Second, we introduced SQL interception rules (e.g., allowing only RowKey scans, limiting full scans, restricting Count on small tables).

Beyond rule‑based interception, we built cost‑based interception using Phoenix mechanisms. We define guide posts (e.g., allowing scans of up to 400 k rows between specific RowKeys) and perform fine‑grained table splitting. When a rule cannot cover a scenario, an ad‑hoc interception can be added in real time. We also support user‑level throttling, temporarily disabling users or limiting writes on hot tables.

These improvements reduced region‑cluster recovery time from two hours to fifteen minutes, mainly by speeding up the replay of ~10 GB of mainstore data.

Other issues such as DNS resolution failures, disk crashes, and human errors prompted a five‑minute recovery SLA: from detection to resolution and service restoration must be under five minutes. We adopted a dual‑cluster strategy inspired by MySQL, replicating writes to a standby cluster. When one cluster fails, traffic switches to the other, with data replication latency of 10‑20 seconds. The standby cluster can serve cold queries and pre‑warm caches to reduce load on the primary.

Today HBase is a core big‑data storage service in our platform, handling tens of terabytes, peak QPS of 200‑300 k, and billions of daily SQL queries. We are exploring further high‑availability features such as Region Replication for sub‑second failover, multi‑tenant isolation, and HBase 2.0 compatibility. We also aim to replace Phoenix with a custom SQL engine built on SparkSQL, translating SQL to Java for low‑latency in‑memory computation.

Author Introduction: Guo Dongdong, senior technical expert at WuCai’s data platform. Holds a master’s degree from the Chinese Academy of Sciences, previously at Tonghuashun, now responsible for WuCai’s big‑data platform construction. Passionate about distributed crawling, distributed databases, and big‑data computing.

—END—

distributed systemsBig DataHigh AvailabilityHBaserisk controlSQL InterceptionPhoenix
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.