HBase vs Kudu vs ClickHouse: Architecture, Deployment, and Operations Compared
This article provides a side‑by‑side technical comparison of HBase, Kudu, and ClickHouse, covering their installation dependencies, architectural designs, read/write workflows, query capabilities, real‑world use cases at Didi, NetEase, and Ctrip, and practical operational tips.
1. Introduction
Hadoop’s ecosystem offers many storage options. HDFS remains the core for raw data, while HBase serves as a NoSQL component with massive storage and random read/write capabilities. ClickHouse is a column‑oriented OLAP DBMS that supports real‑time SQL analytics, and Apache Kudu, released by Cloudera in 2016, combines random read/write with SQL analytics, addressing gaps in HDFS and HBase.
2. Installation and Deployment Comparison
Rather than detailing full installation steps, the article highlights external component dependencies:
HBase depends on HDFS for storage and Zookeeper for metadata.
Kudu relies on Impala for analytical queries and optionally on a CDH cluster for management.
ClickHouse requires Zookeeper for metadata, a Log Service, and a catalog service for tables.
3. Architecture Comparison
HBase Architecture
Kudu Architecture
ClickHouse Architecture
HBase and Kudu follow a master‑slave model, whereas ClickHouse operates in a multi‑master mode where each server is equal. Both HBase and ClickHouse use Zookeeper for auxiliary metadata, while Kudu’s metadata is managed by its master.
4. Basic Operations Comparison
Data Read/Write
HBase
Read flow: (see diagram)
Write flow: (see diagram)
Kudu
ClickHouse
ClickHouse is an analytical DB; data is generally immutable, so standard UPDATE/DELETE are weakly supported. Updates are performed via ALTER statements called “mutations”, which are asynchronous: the server returns immediately while the mutation is queued.
Key mutation characteristics:
Cannot update primary‑key or partition‑key columns.
Mutations are not atomic across partitions.
Execution follows submission order and cannot be cancelled except via KILL MUTATION.
Completed mutation entries are retained based on the finished_mutations_to_keep setting.
Data Query
HBase
Standard SQL is unavailable; Phoenix plugin is required. Full table scans are discouraged due to cluster impact.
Kudu
Queries are executed through Impala integration.
ClickHouse
Provides excellent query performance for columnar data; queries typically aggregate over large data sets.
5. HBase Use Cases at Didi
Didi stores four main data types in HBase:
Statistical/report data (small volume, high flexibility, moderate latency).
Raw fact data such as orders, GPS traces (large volume, high consistency, low latency).
Intermediate results for model training (large volume, high throughput).
Backup data for disaster recovery.
Key scenarios include:
Real‑time order lifecycle queries for customer service.
Historical order detail queries when Redis is unavailable.
Offline order status analysis.
Write throughput of 10 K events/s and read throughput of 1 K events/s with ≤5 s latency.
RowKey design examples:
Order status table
RowKey = reverse(order_id) + (MAX_LONG - timestamp)Order history table
RowKey = reverse(passenger_id|driver_id) + (MAX_LONG - timestamp)Geo‑hash based RowKey is used for efficient geographic queries, turning HBase into a MongoDB‑like geo‑index.
ETA (estimated time of arrival) service uses HBase as a key‑value cache to provide real‑time ETA calculations, reducing model training time and enabling multi‑city parallelism.
6. Kudu Real‑Time Data Warehouse at NetEase
NetEase leverages Kudu for a real‑time traffic data warehouse. Data ingestion pipeline:
Consume Kafka offsets.
Create KuduContext.
Define Kudu table schema.
Parse traffic logs into a DataFrame.
Upsert DataFrame into Kudu and commit offsets.
private val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
val kuduContext = new KuduContext(kuduMaster, spark.sparkContext)
val flowDf = spark.createDataFrame(rdd.map(r => processFlowLine(r.value))
.filter(row => row.get(0) != null), schema)
kuduContext.upsertRows(flowDf, "impala::kaola_kudu_internal.dwd_kl_flw_app_rt")
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)Performance test shows 75 % of tasks finish within 1 s, with overall latency under 2 s. Increasing spark.streaming.concurrentJobs improves parallelism.
7. ClickHouse Log Analysis at Ctrip
Ctrip migrated log analysis from Elasticsearch to ClickHouse. Logs are pre‑formatted as JSON, matching ClickHouse table schemas. Key practices:
Round‑robin writes across the ClickHouse cluster to balance load.
Batch low‑frequency writes to reduce part count and avoid “Too many parts”.
Prefer local tables over distributed tables to minimize network traffic and merge overhead.
Set sensible daily partitions; avoid timestamp‑based partitions that cause excessive parts.
Query optimization includes two‑step queries for Kibana Table panels: first estimate data volume for a time range, then fetch detailed rows for the adjusted range. This reduces query time by ~1/60 and data volume by ~1/120.
Operational tips for ClickHouse:
New log ingestion and performance tuning.
Scheduled partition cleanup for expired logs.
Monitoring via ClickHouse‑exporter, VictoriaMetrics, and Grafana.
Data migration using ClickHouse‑copier or distributed tables.
Handling slow queries with KILL QUERY and addressing “Too many parts” by adjusting merge settings, write patterns, and partition strategies.
8. Summary
HBase and Kudu share a master‑slave architecture; Kudu inherits many design aspects from HBase but adds row‑level insert/update/delete APIs and near‑Parquet scan performance. ClickHouse excels in query speed for analytical workloads but lacks robust update/delete capabilities. The article’s comparative table (image) summarizes strengths and trade‑offs across architecture, data model, read/write patterns, and operational considerations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
