Big Data 16 min read

Quantifying HBase Write Path: Disk and Network Costs for High‑Throughput Scenarios

This article analytically breaks down HBase's write pipeline, quantifies disk and network overheads for massive random writes, derives formulas for resource consumption under realistic assumptions, and offers concrete tuning recommendations to optimize throughput and reduce cost.

Youzan Coder

Dec 28, 2018

Quantifying HBase Write Path: Disk and Network Costs for High‑Throughput Scenarios

Overview

HBase, based on Google BigTable, is a highly reliable, high‑performance, scalable distributed storage system. This summary focuses on the write path and provides a quantitative analysis of resource consumption for workloads with a small amount of random reads and massive random writes.

HBase Write Path Overview

Writes are first buffered in the in‑memory MemStore. When the MemStore reaches a configured size, it is flushed asynchronously to an HFile on HDFS. At the same time each write is appended to the Write‑Ahead Log (WAL) to guarantee durability.

Flush & Compaction

After each flush the number of HFiles grows, which can degrade read performance and increase system resource usage (e.g., HDFS block count, file descriptors). Compaction merges multiple HFiles into fewer ones, controlling the file count per region, improving data locality, version handling, and deletion‑marker cleanup. Flush and compaction run in independent threads and do not block each other.

System Overhead Quantitative Analysis

The analysis assumes a write‑heavy, event‑type workload with the following simplifying assumptions:

Rowkeys are uniformly distributed (no hot spots).

Write volume is known and data is pre‑partitioned, keeping region distribution stable.

Random reads are negligible and read latency is not a concern.

No multi‑version data or deletions; compaction does not reduce data size.

The write path does not involve random disk I/O, so random IOPS are not a bottleneck.

Typical SATA disks provide sequential write throughput far exceeding 10 Gbps network bandwidth.

RPC bandwidth overhead is ignored.

System Variables

Data size per row s (bytes)

Peak write TPS T HFile replica count R1 (default 3)

WAL replica count R2 (default 3)

WAL compression ratio Cwal (usually 1)

HFile compression ratio C (≈0.2 for DIFF+LZO)

Flush size F (≈128 MB)

Compaction minimum files CT (default 3)

Data TTL TTL (days)

Per‑node data volume D (TB)

Major compaction period M (days, default 20)

The analysis concentrates on two resource metrics: disk usage and network traffic.

Disk Capacity Quantification

Disk usage is modeled as:

V = TTL × 86400 × T × s × C × R1

Example: s=1000, TTL=365, T=200000, C=0.2, R1=3 yields V≈282 TB. Minor costs (WAL logs, temporary compaction files, snapshots, etc.) are not quantified.

Network Capacity Quantification

Network traffic originates from three independent stages: write path (WAL), flush (HFile write), and compaction (major & minor). Each stage is analyzed separately.

Write Path

Network inbound and outbound for WAL writes:

NInWrite = T × s × Cwal × (R2‑1) + (T × s) NOutWrite = T × s × Cwal × (R2‑1)

With T=200000, s=1000, Cwal=1, R2=3 the traffic is ≈600 MB/s inbound and 400 MB/s outbound.

Flush

Network traffic for moving flushed HFiles to HDFS:

NInFlush = s × T × (R1‑1) × C NOutFlush = s × T × (R1‑1) × C

Using the same parameters and R1=3, C=0.2 yields ≈76 MB/s inbound and outbound.

Major Compaction

Assuming data is locally read (short‑circuit) and only the first replica is written locally, the network cost per second is:

NInMajor = D × (R1‑1) / M NOutMajor = D × (R1‑1) / M

For D=10 TB, R1=3, M=20 the traffic is about 12 MB/s inbound and outbound.

Minor Compaction

The maximum number of minor compactions a row experiences, based on default thresholds, is 6. Network cost per second:

NInMinor = s × T × (R1‑1) × C × 6 NOutMinor = s × T × (R1‑1) × C × 6

Result: ≈458 MB/s inbound and outbound.

Overall Network Summary

Summing all components (write, flush, major, minor):

NInTotal = 572 MB/s + 76.3 MB/s + 12 MB/s + 457.8 MB/s = 1118.1 MB/s

NOutTotal = 381 MB/s + 76.3 MB/s + 12 MB/s + 457.8 MB/s = 927.1 MB/s

These figures represent the theoretical minimum under ideal conditions; real‑world traffic can be higher due to uneven partitioning, region splits, low locality, excessive small files, etc.

Practical Optimization Recommendations

Design rowkeys to avoid write hotspots during early adoption.

Increase hbase.hstore.compaction.min to reduce the number of compaction rounds a row undergoes.

Pre‑partition tables based on steady‑state load to minimize region splits.

For latency‑insensitive workloads, set hbase.hstore.compaction.max.size to ~4 GB to avoid large‑file compactions.

If data has TTL and no multi‑versioning, disable periodic major compaction and rely on file expiration.

Compress data before ingestion to lower WAL‑related network traffic (WAL itself cannot be compressed).

Adjust MemStore memory ratio so each region can accumulate a full FlushSize before flushing, producing larger HFiles and reducing subsequent compaction cost.

Conclusion

The analysis provides a formula‑driven evaluation of HBase write‑path resource consumption for high‑throughput scenarios, based on HBase 1.2.6. The quantitative framework helps practitioners size clusters, predict disk and network budgets, and identify effective tuning knobs.

References

Google BigTable – https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf

HBase Official Site – http://hbase.apache.org/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Big Data HBase Write Path Resource Quantification

Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.