Big Data 22 min read

Druid: Architecture, Components, and Use Cases for Real‑Time Analytics

This article provides a comprehensive overview of Druid, an open‑source, distributed, column‑store system for real‑time analytics, detailing its features, limitations, typical use cases, core components, external dependencies, data flow, and high‑availability mechanisms.

Big Data Technology & Architecture

May 31, 2020

Druid: Architecture, Components, and Use Cases for Real‑Time Analytics

Druid is an open‑source, distributed, column‑store system designed for real‑time statistical analysis on massive datasets, providing sub‑second OLAP queries with high concurrency, low latency, and strong reliability.

Key characteristics include columnar storage, bitmap indexing (using the CONCISE algorithm), a shared‑nothing architecture, and support for both real‑time streaming data and batch ingestion.

Limitations: segments are immutable, making data modification costly; Druid cannot handle nested data structures.

Typical use cases: ingesting cleaned records without updates, wide tables without joins, simple metric calculations, high‑precision time‑dimension queries (down to minutes), scenarios where low latency is critical, and analyses tolerant of moderate data‑quality issues.

Druid consists of five core components:

Broker nodes : route external queries to Historical and Realtime nodes, merge partial results, and cache query results using an LRU cache (real‑time data is never cached).

Historical nodes : store and query immutable segments, operate in a shared‑nothing fashion, support tiered storage and load balancing, and can serve queries even if deep storage becomes unavailable.

Realtime nodes : ingest and index streaming events, maintain in‑memory indexes that are periodically persisted as immutable segments, and can optionally use Kafka for reliable ingestion.

Coordinator nodes : act as the cluster master, managing segment distribution, replication, and load balancing via ZooKeeper and MySQL metadata.

Indexing services (Overlord, MiddleManager, Peon): generate segments for both batch and streaming ingestion using a master‑slave architecture.

External dependencies include MySQL for metadata storage, Deep Storage (local disk, HDFS, S3, etc.) for persisting segments, and ZooKeeper for cluster coordination, leader election, and service discovery.

Data flow: real‑time events are indexed by Realtime nodes into segments, which are later persisted to Deep Storage; batch data is ingested via the Indexing Service directly into Deep Storage; Historical nodes load segments from Deep Storage, while Broker nodes route queries to the appropriate Historical or Realtime nodes.

Features such as high availability, fault tolerance, replication, and tiered storage ensure that Druid can continue serving queries even when some nodes or ZooKeeper instances fail.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

real-time analytics distributed architecture OLAP Druid

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.