Databases 7 min read

Understanding HBase Scan Process and Its Performance Compared to Parquet and Kudu

The article explains why HBase read operations are complex due to its LSM‑Tree storage and multi‑version design, details the step‑by‑step Scan workflow, discusses the reasons for its multi‑request architecture, compares scan performance with Parquet and Kudu, and offers recommendations for large‑scale data scanning.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Understanding HBase Scan Process and Its Performance Compared to Parquet and Kudu

HBase read operations are more complex than simple get requests because the storage engine is based on an LSM‑Tree, causing range queries to involve multiple shards, blocks, and files, and because updates and deletions are handled via timestamps and delete markers rather than in‑place modifications.

The Scan process works as follows: first the client checks its local cache; if the data is not found, it scans BlockCache, HFiles, and memcache row by row until about 100 rows are collected, which are then cached on the client and returned to the upper‑level business logic. This 100‑row batch is repeatedly fetched until all data is retrieved.

This multi‑request design is intentional: limiting the amount of data per request prevents bandwidth saturation, avoids client‑side OOM, and reduces the risk of server‑side timeouts during large scans.

Unlike batch get operations that are grouped by region and executed in parallel, Scan operations are sequential and not parallelized.

Scan performance heavily depends on the volume of data; for small OLTP workloads Scan may be acceptable, but for large data sets the performance cannot be guaranteed.

Although HBase is often called a column‑oriented store, it actually uses a column‑family model that behaves more like a row‑store, and its scans are essentially random reads rather than the sequential reads possible with true columnar formats like Parquet.

The inability to perform efficient sequential scans stems from HBase’s support for updates and multi‑version data, which results in many files per column family and requires searching across them to locate the correct version.

For large‑scale scans, exporting HBase data to Parquet is recommended. Kudu offers intermediate performance—better than HBase because it is a pure column store without random seeks, but slower than Parquet due to its LSM structure that still requires scanning multiple files and merging keys.

In summary, HBase’s architecture introduces inherent disadvantages for scan operations compared to Parquet, primarily because of its LSM‑Tree design, multi‑version handling, and lack of true columnar sequential access.

References: http://hbasefly.com/2016/12/21/hbase-getorscan/ http://hbasefly.com/2017/10/29/hbase-scan-3/

performanceLSM TreeHBasedatabasesBigDataSCAN
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.