Databases 11 min read

An Overview of Apache Kudu: Architecture, Table Design, and Storage Details

This article provides a comprehensive introduction to Apache Kudu, covering its origins, cluster architecture with Raft consensus, table schema and partitioning design, and detailed storage mechanisms including MemRowSet, DiskRowSet, CFile, and compaction processes.

Big Data Technology & Architecture

Jul 28, 2019

An Overview of Apache Kudu: Architecture, Table Design, and Storage Details

Prologue

Kudu is a relatively young columnar storage engine in the big‑data ecosystem, originally developed internally at Cloudera in C++ and released as version 1.0 in September 2016 (latest 1.9). It aims at “fast analytics on fast data” and has been used in production for calendar‑data analytics.

Kudu’s Purpose

Before Kudu, large‑scale data was handled either by batch‑oriented columnar formats (Parquet, ORC) stored on HDFS for OLAP, or by NoSQL stores (HBase, Cassandra) for OLTP. Neither approach satisfied workloads that require both real‑time reads/writes and multi‑dimensional analytics. Kudu was designed to fill this gap.

Cluster Architecture and Consensus Guarantees

Kudu follows a master‑slave model similar to HBase. A single Master coordinates metadata, tablet placement, and recovery, while multiple Tablet Servers (TServers) host tablets that store rows. Masters use the Raft protocol for high‑availability; only one Master is the leader at a time, and each tablet has a leader replica and follower replicas, all replicated an odd number of times (typically three).

Clients cache tablet location metadata and only query the Master when a tablet’s leader changes.

Table and Partition Design

Kudu tables have a strong schema and columnar storage, unlike schema‑less HBase tables. When creating a table, each column’s type must be declared, and Kudu applies type‑specific compression. A primary‑key set (one or more columns) is required, forming a clustered index.

Tables can be partitioned by hash (similar to Cassandra) and/or range (similar to HBase). Hash partitions map to tablets, as do range partitions, allowing flexible data distribution and hotspot avoidance.

CREATE TABLE tmp.metrics (
    host STRING NOT NULL,
    metric STRING NOT NULL,
    time INT NOT NULL,
    value1 DOUBLE NOT NULL,
    value2 STRING,
    PRIMARY KEY (host, metric, time)
)
PARTITION BY HASH (host, metric) PARTITIONS 4,
RANGE (time) (
    PARTITION VALUES < 20140101,
    PARTITION 20140101 <= VALUES < 20150101,
    PARTITION 20150101 <= VALUES < 20160101,
    PARTITION 20160101 <= VALUES < 20170101,
    PARTITION 20170101 <= VALUES
)
STORED AS KUDU;

Underlying Storage Design Details

Each tablet is split into multiple RowSets. An in‑memory RowSet (MemRowSet) is a B+‑tree keyed by the primary key; updates are appended as MVCC chains rather than overwriting existing nodes. When MemRowSet reaches a size limit (default 32 MB), it is flushed to disk, becoming a DiskRowSet.

DiskRowSet stores columnar files called CFiles (BaseData). Subsequent modifications are written to a DeltaMemStore, which is later flushed to RedoFiles (similar to redo logs). An UndoFile preserves the state before the last flush, enabling time‑travel queries.

Compaction merges RedoFiles (minor compaction) and eventually rewrites BaseData with all accumulated changes (major compaction). RowSet compaction merges multiple DiskRowSets into a single one, reducing key range overlap and improving storage efficiency.

To locate the correct DiskRowSet for a given key, Kudu maintains a segment‑tree‑like index (a variant of a red‑black tree) that stores the min‑max key range of each RowSet, enabling O(log n) lookup.

Conclusion

Kudu’s design combines OLTP‑style low‑latency writes with OLAP‑style columnar scans, leveraging Raft for consensus, flexible partitioning, and sophisticated storage layers (MemRowSet, DiskRowSet, CFile, Redo/Undo files) to achieve strong consistency and high performance for fast‑changing data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Big Data Database Architecture Kudu

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.