Understanding Druid: Real‑time OLAP Architecture, Features, Ingestion, and Querying
This article provides a comprehensive overview of Apache Druid, covering its real‑time OLAP design, core features, six‑component architecture, segment storage model, data ingestion pipelines (including Tranquility and Kafka), native and SQL query interfaces, and practical tuning tips with code examples.
Druid is an open‑source, high‑performance data store designed for real‑time OLAP on large data sets, often powering GUI analytics and high‑concurrency APIs.
Key Features include columnar storage, scalable distributed architecture, parallel computation, real‑time and batch ingestion, cloud‑native design, efficient indexing, time‑based partitioning, and automatic aggregation.
The most attractive features for many users are instant queryability after ingestion, automatic real‑time aggregation, and efficient index structures.
Architecture consists of six components: Coordinator (segment distribution), Overlord (task management), Broker (query routing and aggregation), Router (optional front‑end), Historical (local segment cache), and MiddleManager (task executor). Queries flow from Router to Broker, then to MiddleManager for in‑flight data or Historical for persisted segments, while metadata resides in MySQL and Zookeeper.
Data Storage uses segments partitioned by time, stored both in Deep Storage (e.g., local, HDFS, S3) and cached locally by Historical nodes. Each segment contains timestamps, dimensions, and metrics; dimensions are encoded as integer IDs and indexed with bitmap/inverted lists, enabling fast filter and group operations such as Page='Justin Bieber' and Username='Boxer'.
Ingestion supports real‑time (Tranquility) and batch (Hadoop) modes, forming a Lambda architecture. The article shares Scala code for defining event schemas, a BeamFactory, and Spark Streaming integration, illustrating how to configure Zookeeper, dimensions, aggregators, rollup, and tuning parameters.
case class MetricEvent(jsonString: String) { ... } class MetricEventBeamFactory extends BeamFactory[MetricEvent] { ... } loghubStream.foreachRDD(rdd => rdd.map(x => MetricEvent(new String(x))).propagate(new MetricEventBeamFactory))Query Interfaces include native JSON queries sent via HTTP POST to the Broker and a SQL layer built on Apache Calcite. Example native query JSON and a SQL aggregation using REGEXP_EXTRACT are provided.
{ "queryType": "timeseries", "dataSource": "sample_datasource", ... }Various query types are supported: Timeseries, TopN, GroupBy, Metadata queries (Time Boundary, Segment Metadata, Datasource Metadata), Search, Scan, and Select (deprecated).
Segment Management involves Deep Storage for cold data and segment cache for hot data, with configurable TTL and rules (Load, Drop, Broadcast) that can be period‑based, interval‑based, or forever, managed by the Coordinator.
Realtime Query Tuning suggestions include adjusting druid.processing.numThreads, druid.processing.buffer.sizeBytes, druid.peon.xmx.gb, and druid.indexer.runner.javaOpts to avoid bottlenecks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
