Big Data 12 min read

An Overview of Apache Kudu: Architecture, Table Design, and Storage Details

This article provides a comprehensive introduction to Apache Kudu, covering its origins, cluster architecture with Raft consensus, schema‑based table and partition design, and the intricate storage engine that combines in‑memory and on‑disk structures to deliver fast OLTP and OLAP capabilities on fast data.

Big Data Technology & Architecture

Jul 14, 2019

An Overview of Apache Kudu: Architecture, Table Design, and Storage Details

Prologue

Kudu is a relatively young component in the big‑data ecosystem, originally an internal storage project at Cloudera written in C++. Its 1.0 version was released in September 2016 and the latest is 1.9. As a column‑oriented storage engine, it targets "fast analytics on fast data" and has been used in calendar data analysis workloads.

Kudu's Motivation

Before Kudu, large‑scale data in distributed systems was handled in two ways: static data persisted in HDFS as Parquet/ORC for batch OLAP, and dynamic data stored in NoSQL systems such as HBase or Cassandra for low‑latency OLTP. The former lacks update/delete support and random‑access efficiency, while the latter performs poorly for bulk analytics. Traditional workarounds either layered OLAP on top of NoSQL (incurring performance loss) or duplicated online data to HDFS (doubling storage cost and sacrificing freshness). Kudu was designed to bridge this OLTP‑OLAP gap.

Cluster Architecture and Consensus Guarantees

Kudu adopts a master‑slave architecture similar to HBase. A single Master (catalog manager, coordinator, and tablet directory) oversees metadata and tablet placement, while multiple Tablet Servers (TServers) host tablets and serve read/write requests. Kudu does not rely on ZooKeeper; instead it implements its own Raft‑based consensus. Multiple Masters may exist, but only one acts as the leader at any time, and each tablet is replicated an odd number of times (typically three) with a leader handling writes and all replicas serving reads.

When a client updates data, it first contacts the Master to locate the tablet leader, caches this metadata, and only re‑queries the Master if the leader changes.

Table and Partition Design

Kudu tables are schema‑based and column‑oriented, unlike schema‑less NoSQL stores. Each column’s type is explicitly declared, and Kudu applies type‑specific compression encodings. A primary‑key group (one or more columns) provides a unique clustered index similar to traditional RDBMS. Kudu supports two partitioning methods: hash partitioning (like Cassandra) and range partitioning (like HBase), which can be combined for flexible data distribution.

Example DDL:

CREATE TABLE tmp.metrics (
    host STRING NOT NULL,
    metric STRING NOT NULL,
    time INT NOT NULL,
    value1 DOUBLE NOT NULL,
    value2 STRING,
    PRIMARY KEY (host, metric, time)
)
PARTITION BY HASH (host, metric) PARTITIONS 4,
RANGE (time) (
    PARTITION VALUES < 20140101,
    PARTITION 20140101 <= VALUES < 20150101,
    PARTITION 20150101 <= VALUES < 20160101,
    PARTITION 20160101 <= VALUES < 20170101,
    PARTITION 20170101 <= VALUES
)
STORED AS KUDU;

The table uses a three‑column primary key, hash‑partitions on two string columns, and range‑partitions on a time column, resulting in an orthogonal partition layout.

Underlying Storage Design Details

Kudu implements its own storage layer rather than relying on HDFS. Data is organized into tablets, each split into multiple RowSets. An in‑memory RowSet (MemRowSet) is a B+‑tree keyed by the primary key; updates are appended as MVCC chains to leaf nodes. When MemRowSet reaches its size limit (default 32 MB), it is flushed to disk, forming a DiskRowSet.

On‑disk data is stored in column‑oriented CFiles. Changes after a flush are kept in a DeltaMemStore, which is later persisted as RedoFiles (similar to redo logs). UndoFiles capture the state before the last flush, enabling time‑travel queries. Compaction merges RedoFiles (minor) or rewrites all changes back into BaseData (major), and also merges multiple DiskRowSets.

To locate a key efficiently among many DiskRowSets, Kudu maintains an interval‑tree index (a variant of a red‑black binary search tree) that stores the min and max keys of each RowSet, allowing O(log n) lookup.

— THE END —

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Raft consensus Table Design Kudu

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.