Databases 20 min read

HBase vs Kudu vs ClickHouse: Architecture, Deployment, and Operations Compared

This article provides a side‑by‑side technical comparison of HBase, Kudu, and ClickHouse, covering their installation dependencies, architectural designs, read/write workflows, query capabilities, real‑world use cases at Didi, NetEase, and Ctrip, and practical operational tips.

dbaplus Community

Jun 22, 2021

HBase vs Kudu vs ClickHouse: Architecture, Deployment, and Operations Compared

1. Introduction

Hadoop’s ecosystem offers many storage options. HDFS remains the core for raw data, while HBase serves as a NoSQL component with massive storage and random read/write capabilities. ClickHouse is a column‑oriented OLAP DBMS that supports real‑time SQL analytics, and Apache Kudu, released by Cloudera in 2016, combines random read/write with SQL analytics, addressing gaps in HDFS and HBase.

2. Installation and Deployment Comparison

Rather than detailing full installation steps, the article highlights external component dependencies:

HBase depends on HDFS for storage and Zookeeper for metadata.

Kudu relies on Impala for analytical queries and optionally on a CDH cluster for management.

ClickHouse requires Zookeeper for metadata, a Log Service, and a catalog service for tables.

3. Architecture Comparison

HBase Architecture

Kudu Architecture

ClickHouse Architecture

HBase and Kudu follow a master‑slave model, whereas ClickHouse operates in a multi‑master mode where each server is equal. Both HBase and ClickHouse use Zookeeper for auxiliary metadata, while Kudu’s metadata is managed by its master.

4. Basic Operations Comparison

Data Read/Write

HBase

Read flow: (see diagram)

Write flow: (see diagram)

Kudu

ClickHouse

ClickHouse is an analytical DB; data is generally immutable, so standard UPDATE/DELETE are weakly supported. Updates are performed via ALTER statements called “mutations”, which are asynchronous: the server returns immediately while the mutation is queued.

Key mutation characteristics:

Cannot update primary‑key or partition‑key columns.

Mutations are not atomic across partitions.

Execution follows submission order and cannot be cancelled except via KILL MUTATION.

Completed mutation entries are retained based on the finished_mutations_to_keep setting.

Data Query

HBase

Standard SQL is unavailable; Phoenix plugin is required. Full table scans are discouraged due to cluster impact.

Kudu

Queries are executed through Impala integration.

ClickHouse

Provides excellent query performance for columnar data; queries typically aggregate over large data sets.

5. HBase Use Cases at Didi

Didi stores four main data types in HBase:

Statistical/report data (small volume, high flexibility, moderate latency).

Raw fact data such as orders, GPS traces (large volume, high consistency, low latency).

Intermediate results for model training (large volume, high throughput).

Backup data for disaster recovery.

Key scenarios include:

Real‑time order lifecycle queries for customer service.

Historical order detail queries when Redis is unavailable.

Offline order status analysis.

Write throughput of 10 K events/s and read throughput of 1 K events/s with ≤5 s latency.

RowKey design examples:

Order status table

RowKey = reverse(order_id) + (MAX_LONG - timestamp)

Order history table

RowKey = reverse(passenger_id|driver_id) + (MAX_LONG - timestamp)

Geo‑hash based RowKey is used for efficient geographic queries, turning HBase into a MongoDB‑like geo‑index.

ETA (estimated time of arrival) service uses HBase as a key‑value cache to provide real‑time ETA calculations, reducing model training time and enabling multi‑city parallelism.

6. Kudu Real‑Time Data Warehouse at NetEase

NetEase leverages Kudu for a real‑time traffic data warehouse. Data ingestion pipeline:

Consume Kafka offsets.

Create KuduContext.

Define Kudu table schema.

Parse traffic logs into a DataFrame.

Upsert DataFrame into Kudu and commit offsets.

private val stream = KafkaUtils.createDirectStream[String, String](
    ssc,
    PreferConsistent,
    Subscribe[String, String](topics, kafkaParams)
)

val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
val kuduContext = new KuduContext(kuduMaster, spark.sparkContext)
val flowDf = spark.createDataFrame(rdd.map(r => processFlowLine(r.value))
    .filter(row => row.get(0) != null), schema)
kuduContext.upsertRows(flowDf, "impala::kaola_kudu_internal.dwd_kl_flw_app_rt")
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)

Performance test shows 75 % of tasks finish within 1 s, with overall latency under 2 s. Increasing spark.streaming.concurrentJobs improves parallelism.

7. ClickHouse Log Analysis at Ctrip

Ctrip migrated log analysis from Elasticsearch to ClickHouse. Logs are pre‑formatted as JSON, matching ClickHouse table schemas. Key practices:

Round‑robin writes across the ClickHouse cluster to balance load.

Batch low‑frequency writes to reduce part count and avoid “Too many parts”.

Prefer local tables over distributed tables to minimize network traffic and merge overhead.

Set sensible daily partitions; avoid timestamp‑based partitions that cause excessive parts.

Query optimization includes two‑step queries for Kibana Table panels: first estimate data volume for a time range, then fetch detailed rows for the adjusted range. This reduces query time by ~1/60 and data volume by ~1/120.

Operational tips for ClickHouse:

New log ingestion and performance tuning.

Scheduled partition cleanup for expired logs.

Monitoring via ClickHouse‑exporter, VictoriaMetrics, and Grafana.

Data migration using ClickHouse‑copier or distributed tables.

Handling slow queries with KILL QUERY and addressing “Too many parts” by adjusting merge settings, write patterns, and partition strategies.

8. Summary

HBase and Kudu share a master‑slave architecture; Kudu inherits many design aspects from HBase but adds row‑level insert/update/delete APIs and near‑Parquet scan performance. ClickHouse excels in query speed for analytical workloads but lacks robust update/delete capabilities. The article’s comparative table (image) summarizes strengths and trade‑offs across architecture, data model, read/write patterns, and operational considerations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data ClickHouse HBase Database Comparison Kudu

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.