Kudu Overview: Architecture, Features, and Use Cases
Kudu is an open‑source columnar storage engine from Cloudera that combines high‑throughput batch processing with low‑latency random reads, offering features such as C++/Java APIs, Raft‑based replication, flexible consistency, partitioning, and integration with Hadoop, Spark, Impala, and other ecosystem components.
Kudu is a Cloudera‑open‑source columnar storage engine designed to bridge the gap between high‑throughput batch processing (HDFS) and low‑latency random access (HBase), enabling both OLAP and OLTP workloads.
Key Features include C++ development with Java and C++ APIs, efficient OLAP load handling, seamless integration with MapReduce, Spark, and other Hadoop components, Impala compatibility, flexible consistency models, strong performance for both sequential and random writes, high availability via the Raft consensus protocol, and a structured data model.
Typical Use Cases involve real‑time data updates, time‑series analytics requiring massive historical scans and fast point‑lookups, and streaming analytics where periodic model updates are needed.
Architecture follows a master‑tablet server (M‑S) model. A Kudu cluster consists of one or more Master nodes managing metadata and multiple Tablet Servers storing data. Masters use Raft for metadata replication, while Tablet Servers host tablets (data partitions) with leader‑follower replication.
Core Concepts include tables, tablets, rowsets, and the distinction between MemRowSet (in‑memory writes) and DiskRowSet (flushed to disk). Data is stored as BaseData plus Delta files (UNDO and REDO) to support versioning and efficient updates.
Schema and Primary Keys require explicit column types and a non‑nullable primary key; primary keys are immutable, and secondary indexes are not supported. Column types support encoding (Plain, Bitshuffle, Run‑Length, Dictionary, Prefix) and compression (LZ4, Snappy, Zlib).
Partitioning can be range, hash, or multi‑level combinations. Range partitions are useful for time‑series data, while hash partitions help distribute write load. Proper partition design mitigates hotspots and improves query performance.
Limitations include a default maximum of 300 columns per table, cell size limits (64 KB), and the inability to modify primary keys or column types after creation.
Kudu vs. HBase – Kudu stores data column‑wise with its own storage layer, whereas HBase relies on HDFS and stores data in column families. Kudu provides stronger consistency via Raft, avoids HDFS latency for small reads, and separates insert and update paths, leading to different performance characteristics.
Production Practices at NetEase illustrate Kudu’s role in real‑time data pipelines, dimensional table joins, and ETL replacement for Oracle, highlighting challenges such as load imbalance, schema complexity, and lack of secondary indexes, and describing ongoing enhancements like Bloom filters and flexible hash bucket sizing.
Performance Tuning covers hardware recommendations (NVMe SSDs for WAL, multiple SSDs for data), OS settings (file descriptors, swap), and Kudu configuration parameters (memory limits, block cache, disk reservation, maintenance threads).
Future Directions include adding Bloom filters, dynamic hash bucket adjustment, multi‑row transactions, and support for tables without primary keys.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
