Big Data 9 min read

How Dremel Encodes Nested Data: Definition & Repetition Levels Explained

This article breaks down Dremel's columnar encoding for nested data, detailing the definition‑level and repetition‑level concepts, showing step‑by‑step examples of encoding and reconstructing JSON‑like schemas, and explaining the limits of single‑column reconstruction.

Big Data Technology Tribe

Mar 22, 2026

How Dremel Encodes Nested Data: Definition & Repetition Levels Explained

Introduction

Efficient storage and query of nested data (e.g., JSON, Protocol Buffers) is a long‑standing challenge for big‑data systems. Traditional relational databases require complex joins or denormalization, which increase redundancy and slow queries. Google’s Dremel paper introduced a columnar approach that directly supports nested structures, influencing formats such as Apache Parquet, Apache Arrow, and Lance.

Core Concepts: Definition Level (DL) and Repetition Level (RL)

DL indicates how many optional levels of a field are defined (i.e., the depth of uncertainty). RL records how many repeated levels are reused compared with the previous row. Together they allow a tree‑shaped schema to be flattened into a highly compressible columnar representation while preserving the ability to reconstruct the original hierarchy without loss.

Encoding Example: Order Schema

message Order {
  required string order_id;
  optional group user {
    required string name;
    optional int age;
  }
  repeated group items {
    required string item_id;
    optional group price {
      required double amount;
      optional string currency;
    }
  }
}

Maximum DL is 3 because the deepest optional path items.price.currency contains three non‑required fields. The only repeated field is items, so the maximum RL is 1.

Encoding the field items.price.currency yields:

currency   RL  DL
USD        0   3
NULL       1   2

Explanation of the two rows:

Row 1 (USD): items exists, price exists, currency exists → DL=3, RL=0 (new order).

Row 2 (NULL): items exists, price exists, currency missing → DL=2, RL=1 (same items as previous row).

Reconstruction Example: Document Schema with Nested Repeated Fields

message Document {
  required string doc_id;
  repeated group paragraphs {
    repeated group words {
      required string word;
    }
  }
}

For the path paragraphs.words.word, DL = 2 (both paragraphs and words are repeated, word is required) and RL = 2 (two repeated levels).

Encoding the sample data produces:

word   RL  DL
hello  0   2
world  2   2
good   1   2

Row explanations: hello: new document → RL=0, DL=2 (both repeated levels present). world: same paragraphs and words as previous row → RL=2, DL=2. good: same paragraphs but a new words array → RL=1, DL=2.

Reconstruction Rules

Rule 1: RL determines how many levels of the current path stack are reused.

Rule 2: DL determines whether a value is present for the current row (i.e., whether to insert).

Rule 3: Always maintain a "current path stack", e.g., [Document, paragraph, words].

Using these rules, the article walks through step‑by‑step reconstruction of the original JSON structures from the columnar rows, showing how to handle RL=0 (create new repeated containers), RL>0 (reuse containers), and DL values (insert or skip). It also highlights a limitation: a single column (e.g., word) cannot fully reconstruct records that contain empty arrays or structures; additional columns or schema metadata are required.

Key Takeaways

Dremel’s DL/RL mechanism provides a compact, lossless columnar encoding for arbitrarily nested data.

Understanding the maximum DL and RL for a schema is essential before encoding.

Reconstruction relies on both DL/RL values and the presence of all relevant columns; otherwise empty repeated structures cannot be recovered.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

columnar storage parquet Apache Arrow nested data definition level Dremel repetition level

Written by

Big Data Technology Tribe

Focused on computer science and cutting‑edge tech, we distill complex knowledge into clear, actionable insights. We track tech evolution, share industry trends and deep analysis, helping you keep learning, boost your technical edge, and ride the digital wave forward.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.