Understanding Apache Parquet: Architecture, Data Model, and Performance
This article provides a comprehensive overview of Apache Parquet, covering its modular architecture, nested data model, striping/assembly and definition level algorithms, file format details, push‑down optimizations, performance benchmarks, and the project's evolution within the big‑data ecosystem.
Overview
Parquet is a language‑independent columnar storage format that can be used with a wide range of query engines (Hive, Impala, Presto, etc.) and processing frameworks (MapReduce, Spark, etc.). It stores data in a self‑describing binary file, making it suitable for OLAP workloads.
Project Composition
The Parquet project consists of several sub‑projects:
parquet‑format: Java implementation that defines all metadata objects, serialized with Apache Thrift and stored at the file footer.
parquet‑compatibility: Test code for cross‑language (Java/C++) read/write compatibility.
parquet‑cpp: C++ library for reading and writing Parquet files.
Data Model
Parquet supports nested schemas similar to Protocol Buffers. Each field has a repetition (required, optional, repeated) and a type (group or primitive). An example schema is:
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country;
}
optional string Url;
}
}The schema can be visualised as a tree where leaf nodes correspond to primitive columns.
Striping/Assembly Algorithm
To reconstruct a record from columnar storage, each value is stored with three components: value, repetition level, and definition level. The repetition level indicates at which repeated node a new instance starts, while the definition level marks the depth at which a value becomes undefined (NULL).
Example of repetition level calculation for a nested schema:
message nested {
repeated group level1 {
repeated string level2;
}
}
r1: [[a,b,c], [d,e,f,g]]
r2: [[h], [i,j]]Values are assigned repetition levels based on shared ancestry, enabling compact encoding.
Definition Levels
Definition levels are used to represent missing optional values. For a schema with optional groups, the definition level records the deepest level where a value becomes undefined.
message ExampleDefinitionLevel {
optional group a {
optional group b {
optional string c;
}
}
}File Format
A Parquet file is composed of:
Magic code at the beginning for file validation.
Row groups, each containing column chunks.
Column chunks, which are further divided into pages (data, dictionary, index).
Footer with metadata, including schema and per‑row‑group statistics.
Push‑down optimizations include:
Column (Projection) Push‑Down : Only the required columns are read.
Predicate Push‑Down : Row groups are skipped based on min/max statistics stored in column chunks.
Performance
Benchmarks on TPC‑H/TP‑C‑DS datasets show that Parquet achieves higher compression ratios and lower I/O compared to row‑based formats, leading to significant query speedups in engines such as Impala and Hive.
Project Evolution
Started in 2012 by Twitter and Cloudera, Parquet has grown with contributions from Criteo and many open‑source projects. Version 2.0 introduces a new page format, additional primitive types (Decimal, Timestamp), and richer statistics like Bloom filters.
Conclusion
Parquet provides an efficient columnar storage solution for nested data, offering compression, encoding, and push‑down capabilities that improve OLAP query performance, and it continues to evolve to meet growing data‑analytics demands.
References
Dremel: Interactive Analysis of Web‑Scale Datasets
Dremel made simple with Parquet
Parquet: Columnar storage for the people
Efficient Data Storage for Analytics with Apache Parquet 2.0
深入分析Parquet列式存储格式
Apache Parquet Document
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
