Overview of Apache Druid Architecture and Its Comparison with Other Analytics Systems
This article provides a comprehensive overview of Apache Druid's distributed column‑store architecture, detailing its node types, external dependencies, data flow, and operational mechanisms, and compares Druid's real‑time analytics capabilities with systems such as Impala, Elasticsearch, and Spark.
Druid is an open‑source, distributed, column‑store system designed for real‑time data analysis, offering fast aggregation, flexible filtering, millisecond‑level queries, and low‑latency data ingestion.
Druid is highly available; node failures do not stop the cluster, though state updates may be affected.
Components are loosely coupled, allowing the real‑time node to be omitted if not needed.
Bitmap indexing with the CONCISE compression algorithm dramatically reduces segment size and speeds up queries.
Architecture
Druid clusters consist of several specialized node types, each handling a specific set of responsibilities, which isolates concerns and simplifies the overall system.
Node interactions are minimal, so communication failures have little impact on data availability.
The cluster composition and data flow are illustrated in Figure 1.
Druid includes five node types: Realtime, Historical, Coordinator, Broker, and Indexer.
Historical node : stores and queries non‑real‑time data, loading segments from Deep Storage and serving Broker queries. It can serve queries even if Deep Storage becomes unavailable, provided segments are already cached locally.
Realtime node : ingests and queries real‑time data, also serving Broker queries. It periodically persists in‑memory indexes to disk and moves completed segments to Historical nodes.
Coordinator node : the master of the cluster, managing Historical and Realtime nodes via ZooKeeper and tracking segment metadata in MySQL.
Broker node : routes external queries to the appropriate Historical or Realtime nodes, merges partial results, and returns the final answer. It also maintains an LRU cache.
Indexer node : handles data ingestion, loading batch and streaming data, and can modify stored data.
Druid relies on three external services: MySQL (metadata storage), Deep Storage (segment storage, supporting local disk, NFS, HDFS, S3, etc.), and ZooKeeper (cluster state coordination).
Realtime Node
The Realtime node buffers event data in memory, periodically persisting indexed data to disk and eventually uploading immutable segments to Deep Storage (e.g., S3 or HDFS). Queries hit both in‑memory and persisted indexes.
Historical Node
Historical nodes follow a shared-nothing architecture, meaning they operate independently without a single point of failure. They load, drop, and process segments announced via ZooKeeper, caching segments locally before serving queries.
Coordinator Node
The Coordinator manages segment distribution, instructs Historical nodes to load new data, unload expired data, replicate segments, and balance load. It uses a multi‑version concurrency control protocol to track immutable segments and performs leader election for high availability.
Broker Node
The Broker routes queries to the appropriate nodes, merges partial results, and caches frequently accessed segment results. Real‑time data is never cached to ensure freshness.
Indexer Node
The Indexer service runs distributed, high‑availability indexing tasks that create or replace Druid segments. It follows a master/worker architecture with Overlord, middle‑manager, and Peon components, which can be co‑located or spread across nodes.
ZooKeeper
ZooKeeper is used for leader election (Coordinator and Overlord), segment publication protocols, and task management for the indexing service.
Comparison with Other Systems
Druid vs Impala/Shark
Druid is built for always‑online services, real‑time data ingestion, and slice‑and‑dice ad‑hoc queries. Its columnar storage with compressed bitmap indexes enables faster queries than Impala/Shark, which rely on HDFS‑based caching without the same level of query acceleration.
Druid vs Elasticsearch
Elasticsearch provides full‑text search and raw event access, while Druid focuses on OLAP workloads, offering high‑performance aggregation at lower cost and supporting structured event data search.
Druid vs Spark
Spark is a general‑purpose cluster computing framework centered on RDDs for iterative algorithms and machine learning, whereas Druid specializes in low‑latency data ingestion and fast analytical queries for web‑facing dashboards.
Source: http://my.oschina.net/betaoo/blog/530088 (author beta‑o‑.)
— THE END —
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
