How Dremel Encodes Nested Data: Definition & Repetition Levels Explained
This article breaks down Dremel's columnar encoding for nested data, detailing the definition‑level and repetition‑level concepts, showing step‑by‑step examples of encoding and reconstructing JSON‑like schemas, and explaining the limits of single‑column reconstruction.
Introduction
Efficient storage and query of nested data (e.g., JSON, Protocol Buffers) is a long‑standing challenge for big‑data systems. Traditional relational databases require complex joins or denormalization, which increase redundancy and slow queries. Google’s Dremel paper introduced a columnar approach that directly supports nested structures, influencing formats such as Apache Parquet, Apache Arrow, and Lance.
Core Concepts: Definition Level (DL) and Repetition Level (RL)
DL indicates how many optional levels of a field are defined (i.e., the depth of uncertainty). RL records how many repeated levels are reused compared with the previous row. Together they allow a tree‑shaped schema to be flattened into a highly compressible columnar representation while preserving the ability to reconstruct the original hierarchy without loss.
Encoding Example: Order Schema
message Order {
required string order_id;
optional group user {
required string name;
optional int age;
}
repeated group items {
required string item_id;
optional group price {
required double amount;
optional string currency;
}
}
}Maximum DL is 3 because the deepest optional path items.price.currency contains three non‑required fields. The only repeated field is items, so the maximum RL is 1.
Encoding the field items.price.currency yields:
currency RL DL
USD 0 3
NULL 1 2Explanation of the two rows:
Row 1 (USD): items exists, price exists, currency exists → DL=3, RL=0 (new order).
Row 2 (NULL): items exists, price exists, currency missing → DL=2, RL=1 (same items as previous row).
Reconstruction Example: Document Schema with Nested Repeated Fields
message Document {
required string doc_id;
repeated group paragraphs {
repeated group words {
required string word;
}
}
}For the path paragraphs.words.word, DL = 2 (both paragraphs and words are repeated, word is required) and RL = 2 (two repeated levels).
Encoding the sample data produces:
word RL DL
hello 0 2
world 2 2
good 1 2Row explanations: hello: new document → RL=0, DL=2 (both repeated levels present). world: same paragraphs and words as previous row → RL=2, DL=2. good: same paragraphs but a new words array → RL=1, DL=2.
Reconstruction Rules
Rule 1: RL determines how many levels of the current path stack are reused.
Rule 2: DL determines whether a value is present for the current row (i.e., whether to insert).
Rule 3: Always maintain a "current path stack", e.g., [Document, paragraph, words].
Using these rules, the article walks through step‑by‑step reconstruction of the original JSON structures from the columnar rows, showing how to handle RL=0 (create new repeated containers), RL>0 (reuse containers), and DL values (insert or skip). It also highlights a limitation: a single column (e.g., word) cannot fully reconstruct records that contain empty arrays or structures; additional columns or schema metadata are required.
Key Takeaways
Dremel’s DL/RL mechanism provides a compact, lossless columnar encoding for arbitrarily nested data.
Understanding the maximum DL and RL for a schema is essential before encoding.
Reconstruction relies on both DL/RL values and the presence of all relevant columns; otherwise empty repeated structures cannot be recovered.
Big Data Technology Tribe
Focused on computer science and cutting‑edge tech, we distill complex knowledge into clear, actionable insights. We track tech evolution, share industry trends and deep analysis, helping you keep learning, boost your technical edge, and ride the digital wave forward.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
