Big Data 9 min read

Deep Dive into Apache Druid V1 Data Storage Format and Architecture

This article provides an in‑depth analysis of Apache Druid V1’s column‑oriented storage format, covering its dictionary, encoded dimension values, bitmap inverted index, array handling, and how these structures are used during query execution, illustrated with diagrams and code examples.

Big Data Technology & Architecture

Jan 1, 2021

Deep Dive into Apache Druid V1 Data Storage Format and Architecture

Apache Druid is a high‑performance OLAP engine whose storage format is a core component for achieving sub‑second queries on massive datasets. The article examines Druid V1’s custom data format, focusing on index structures and on‑disk storage.

Dimension Data Structure

Druid stores data column‑wise, separating dimension columns (which have indexes) from metric columns (which store raw row values). Using a sample advertising‑effect dataset, the article shows how dimensions are independently stored.

Dictionary

The dictionary de‑duplicates all values of a column, sorts them, and assigns each a numeric code equal to its array index. This enables compact, fixed‑length integer encoding and reduces storage overhead.

The logical and physical dictionary structures are illustrated: a linear array for values, an index section storing offsets, and a data section storing each value’s length.

Encoded Dimension Values

Even after aggregation, encoded dimension values may repeat, so Druid stores them separately from the dictionary. Integers are encoded with variable‑length byte sequences depending on the cardinality of the dimension (e.g., 1 byte for up to 2⁸‑1 distinct values).

1 – 2^8-1 => 1 byte
2^8 - 2^16-1 => 2 bytes
2^16 - 2^24-1 => 3 bytes
2^24 - 2^32-1 => 4 bytes
2^32 - 2^40-1 => 5 bytes
...

For the "city" dimension in the example (3 unique values) each code occupies 1 byte.

The physical layout groups integers into ≤64 KB blocks, compresses each block, and stores offsets and lengths to support non‑fixed‑length data.

Bitmap Inverted Index

For every dictionary entry Druid creates a bitmap where a set bit indicates that the corresponding row contains that value. Bitmaps are based on aggregated data, not raw rows, and are compressed to save space.

Array Dimensions

Array‑type dimensions follow the same dictionary and bitmap design, but their encoded values form a two‑level structure: an outer variable‑length list whose elements are fixed‑length inner lists.

Storage Structure Summary

The article summarizes fixed‑length vs. variable‑length storage patterns and lists the metadata fields used by Druid (version, allowReverseLookup, numBytesUsed, numElements).

version: 1 byte
allowReverseLookup: 1 byte
numBytesUsed: 4 bytes
numElements: 4 bytes

How It Is Used in Queries

A simple SQL example (SELECT city, SUM(click_cnt) FROM table_t WHERE category=0 OR category=1 GROUP BY city) demonstrates the query flow: the dictionary is consulted first, then the bitmap index, and finally the encoded dimension values.

Overall, the article provides a comprehensive technical walkthrough of Druid’s storage mechanisms, illustrating how columnar layout, dictionaries, variable‑length encoding, and compressed bitmap indexes enable fast, low‑latency analytics on big data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

OLAP Bitmap Index data storage Apache Druid dictionary Columnar

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.