Big Data 18 min read

Understanding Apache Parquet: Architecture, Data Model, and Performance

This article provides a comprehensive overview of Apache Parquet, covering its modular architecture, nested data model, striping/assembly and definition level algorithms, file format details, push‑down optimizations, performance benchmarks, and the project's evolution within the big‑data ecosystem.

Big Data Technology & Architecture

Nov 26, 2020

Understanding Apache Parquet: Architecture, Data Model, and Performance

Overview

Parquet is a language‑independent columnar storage format that can be used with a wide range of query engines (Hive, Impala, Presto, etc.) and processing frameworks (MapReduce, Spark, etc.). It stores data in a self‑describing binary file, making it suitable for OLAP workloads.

Project Composition

The Parquet project consists of several sub‑projects:

parquet‑format: Java implementation that defines all metadata objects, serialized with Apache Thrift and stored at the file footer.

parquet‑compatibility: Test code for cross‑language (Java/C++) read/write compatibility.

parquet‑cpp: C++ library for reading and writing Parquet files.

Data Model

Parquet supports nested schemas similar to Protocol Buffers. Each field has a repetition (required, optional, repeated) and a type (group or primitive). An example schema is:

message Document {
    required int64 DocId;
    optional group Links {
        repeated int64 Backward;
        repeated int64 Forward;
    }
    repeated group Name {
        repeated group Language {
            required string Code;
            optional string Country;
        }
        optional string Url;
    }
}

The schema can be visualised as a tree where leaf nodes correspond to primitive columns.

Striping/Assembly Algorithm

To reconstruct a record from columnar storage, each value is stored with three components: value, repetition level, and definition level. The repetition level indicates at which repeated node a new instance starts, while the definition level marks the depth at which a value becomes undefined (NULL).

Example of repetition level calculation for a nested schema:

message nested {
    repeated group level1 {
        repeated string level2;
    }
}

r1: [[a,b,c], [d,e,f,g]]
r2: [[h], [i,j]]

Values are assigned repetition levels based on shared ancestry, enabling compact encoding.

Definition Levels

Definition levels are used to represent missing optional values. For a schema with optional groups, the definition level records the deepest level where a value becomes undefined.

message ExampleDefinitionLevel {
    optional group a {
        optional group b {
            optional string c;
        }
    }
}

File Format

A Parquet file is composed of:

Magic code at the beginning for file validation.

Row groups, each containing column chunks.

Column chunks, which are further divided into pages (data, dictionary, index).

Footer with metadata, including schema and per‑row‑group statistics.

Push‑down optimizations include:

Column (Projection) Push‑Down : Only the required columns are read.

Predicate Push‑Down : Row groups are skipped based on min/max statistics stored in column chunks.

Performance

Benchmarks on TPC‑H/TP‑C‑DS datasets show that Parquet achieves higher compression ratios and lower I/O compared to row‑based formats, leading to significant query speedups in engines such as Impala and Hive.

Project Evolution

Started in 2012 by Twitter and Cloudera, Parquet has grown with contributions from Criteo and many open‑source projects. Version 2.0 introduces a new page format, additional primitive types (Decimal, Timestamp), and richer statistics like Bloom filters.

Conclusion

Parquet provides an efficient columnar storage solution for nested data, offering compression, encoding, and push‑down capabilities that improve OLAP query performance, and it continues to evolve to meet growing data‑analytics demands.

References

Dremel: Interactive Analysis of Web‑Scale Datasets

Dremel made simple with Parquet

Parquet: Columnar storage for the people

Efficient Data Storage for Analytics with Apache Parquet 2.0

深入分析Parquet列式存储格式

Apache Parquet Document

performance Columnar Storage Parquet Pushdown Optimization

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.