Big Data 12 min read

How Huolala Solved HBase Bulkload Challenges: A Practical Guide

This article details Huolala’s experience building a unified Hive‑to‑HBase pipeline, addressing low development efficiency, lack of monitoring, and HBase instability by evaluating two architectures, implementing a generic Transform tool, optimizing compaction and DistCp, and establishing stability and data‑validation mechanisms.

Huolala Tech

May 25, 2023

How Huolala Solved HBase Bulkload Challenges: A Practical Guide

Introduction

HBase is a high‑availability, high‑performance NoSQL database built on Hadoop, used at Huolala for online storage supporting risk control, map, real‑time tags, and other critical business scenarios. In production, large amounts of T+1 data need to be generated daily from Hive and imported into HBase.

Problems and Challenges

Initially, each business wrote its own Hive‑to‑HBase Bulkload code, leading to low development efficiency, lack of chain‑link monitoring, and HBase instability during peak Bulkload operations.

Low development efficiency – duplicated effort.

Lack of link assurance – no visibility of failures or delays.

Impact on HBase stability – Bulkload caused online incidents, even cross‑cluster HFile loads that made HBase unavailable for 20 minutes.

Therefore a unified Hive‑to‑HBase tool with stability guarantees was required, with three main requirements: simple and generic, observable with alerts, and controllable through an approval workflow.

Research

Two typical architectures for Hive‑to‑HBase were evaluated.

Solution 1: Spark/MR reads Hive, writes HFiles directly to the online HBase cluster, then LoadIncrementalHFiles loads them.

Advantages and disadvantages are shown in the diagram.

Solution 2: Write HFiles to the offline cluster, then use Hadoop DistCp to copy them to the online HBase cluster.

Because unrestricted copy speed posed stability risks, Solution 2 was chosen.

Implementation

Transform

The unified Transform script provides multi‑RowKey strategies and column‑name mapping, packaged as a template task in the data‑development platform.

Custom RowKey generation (hash, salt, field slicing).

Column name mapping between Hive and HBase.

During gray‑release, two issues were observed:

Compaction peak resource contention – many small HFiles triggered a CPU/IO spike during the next CompactionChecker run.

Low data locality raising P99 latency – DistCp randomly selected DataNodes, reducing locality.

Solutions applied:

Merge tasks to reduce HFile count per Region.

Adjust table‑level compaction settings to avoid compaction after Bulkload.

Run a dedicated Major Compaction tool during off‑peak hours for tables without TTL.

Enhance DistCp to support favored nodes for better locality.

Compaction Tool

After optimizations, HFile count and bulk‑load‑induced compaction spikes were mitigated. Three scheduling strategies were implemented:

OffpeakCompact – runs only in low‑traffic periods with region election.

TimeCompact – triggers based on elapsed time since last major compaction.

FileNumberCompact – triggers when HFile count exceeds a threshold.

Observed effects show controlled HFile numbers and avoided peak‑time compactions.

DistCp Enhancements

To meet strict latency (P99, P999) requirements, DistCp was extended with:

FavoredNodes specification per HFile.

Multi‑cluster copy with simple configuration.

Related JIRA tickets: HADOOP‑18629, HBASE‑27670, HBASE‑27733.

Data Validation

Bulkload quality is verified by counting rows in Transform, comparing HFile directory sizes before/after DistCp, and monitoring Load metrics, plus sampling RowKey queries.

Stability Assurance

A comprehensive data‑link stability solution was built and is being rolled out.

Conclusion

The article shares Huolala’s pain points, design decisions, and practical implementations for a reliable HBase offline data pipeline, offering references for readers.

// 1. Generate HFile writer
HFileOutputFormat2.getNewWriter()

    HFileContextBuilder contextBuilder = new HFileContextBuilder()
            .withCompression(compression)
            .withChecksumType(HStore.getChecksumType(conf))
            .withBytesPerCheckSum(HStore.getBytesPerChecksum(conf))
            .withBlockSize(blockSize);

// 2. Load HFile split
LoadIncrementalHFiles.copyHFileHalf()
    HFileContext hFileContext = new HFileContextBuilder().withCompression(compression)
        .withChecksumType(HStore.getChecksumType(conf))
        .withBytesPerCheckSum(HStore.getBytesPerChecksum(conf)).withBlockSize(blocksize)
        .withDataBlockEncoding(familyDescriptor.getDataBlockEncoding()).withIncludesTags(true)
        .build();
    halfWriter = new StoreFileWriter.Builder(conf, cacheConf, fs).withFilePath(outFile)
        .withBloomType(bloomFilterType).withFileContext(hFileContext).build();

private static final AtomicBoolean offPeakCompactionTracker = new AtomicBoolean();

// Normal case - coprocessor is not overriding file selection.
if (!compaction.hasSelection()) {
    boolean isUserCompaction = priority == Store.PRIORITY_USER;
    boolean mayUseOffPeak =
        offPeakHours.isOffPeakHour() && offPeakCompactionTracker.compareAndSet(false, true);
    try {
        compaction.select(this.filesCompacting, isUserCompaction, mayUseOffPeak,
            forceMajor && filesCompacting.isEmpty());
    } catch (IOException e) {
        if (mayUseOffPeak) {
            offPeakCompactionTracker.set(false);
        }
        throw e;
    }
    assert compaction.hasSelection();
    if (mayUseOffPeak && !compaction.getRequest().isOffPeak()) {
        // Compaction policy doesn't want to take advantage of off-peak.
        offPeakCompactionTracker.set(false);
    }
}

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data pipeline Compaction HBase bulkload Distcp

Written by

Huolala Tech

Technology reshapes logistics

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.