An Overview of Apache Parquet: Architecture, Storage Model, and Comparison with ORC
This article provides a comprehensive introduction to Apache Parquet, covering its origins, columnar storage advantages, nested schema support, internal architecture, storage model components, comparison with ORC, and practical tools for inspecting Parquet files.
Overall Introduction
Parquet is a mainstream columnar storage format in the Hadoop ecosystem, originally co‑developed by Twitter and Cloudera and graduated to a top‑level Apache project in May 2015. It is widely regarded as the de facto storage‑format standard for big‑data workloads.
Parquet supports nested structures, making it well‑suited for OLAP scenarios where column‑wise storage and scanning are advantageous.
Key Advantages
1. Higher compression ratio : Columnar layout enables efficient compression and encoding per column, reducing disk usage (e.g., gzip, snappy achieve 11×/27×/19× compression in typical cases).
2. Reduced I/O : Column‑ and predicate‑pushdown allow reading only required columns and skipping rows that do not satisfy filter conditions, which significantly lowers unnecessary data scans, especially for wide tables.
Project Overview
Parquet is language‑agnostic and not tied to any specific processing framework. It integrates with many query engines (Hive, Impala, Pig, Presto, Drill, Tajo, HAWQ, IBM Big SQL) and compute frameworks (MapReduce, Spark, Cascading, Crunch, Scalding, Kite), and works with data models such as Avro, Thrift, Protocol Buffers, and POJOs.
The project is organized into three layers:
Storage layer : Defines the Parquet file format in the parquet-format project, including primitive types, page types, encoding, and compression.
Object conversion layer : Implemented in parquet‑mr , this layer maps external object models to Parquet’s internal representation using striping and assembly algorithms.
Object model layer : Provides adapters for Avro, Thrift, Protocol Buffers, Hive SerDe, etc., and includes the org.apache.parquet.example package for converting between Java objects and Parquet files.
Nested Data Model
Parquet supports nested structures, allowing efficient columnar storage of complex objects such as Protobuf, Thrift, or JSON. The schema is expressed with a message keyword, where each field has a repetition attribute (required/repeated/optional), a type, and a name.
message AddressBook {
required string owner;
repeated string ownerPhoneNumbers;
repeated group contacts {
required string name;
optional string phoneNumber;
}
}This schema describes an address book with a single owner, zero‑or‑more phone numbers, and zero‑or‑more contacts, each contact having a mandatory name and an optional phone number.
Storage Model
The Parquet file consists of Row Groups, Column Chunks, and Pages. A Row Group is a horizontal slice of the data, typically aligned with an HDFS block, and processed by a single mapper. Each Column Chunk stores all values of a column within a Row Group and may use a distinct compression codec. Pages are the smallest encoding units within a Column Chunk, allowing different pages to use different encodings.
In addition, Parquet files contain a header and footer that store checksums, schema information, and other metadata.
Parquet vs ORC
Both Parquet and ORC are columnar formats, but they differ in several aspects:
Nested structure support: Parquet handles nested data efficiently, whereas ORC’s support is limited and incurs higher overhead.
ACID and update support: ORC provides ACID‑compatible updates; Parquet does not.
Compression and query performance: Both achieve comparable compression and query speed, with ORC sometimes slightly ahead.
Engine compatibility: Parquet enjoys broader support across Hive, Impala, Presto, and others, while ORC is tightly coupled with Hive and only recently gained experimental Impala support.
Choosing between them should be based on specific requirements; if ORC‑specific features are not needed, Parquet is generally recommended.
Parquet Tools
The open‑source parquet‑tools utility allows inspection of Parquet file metadata and schemas.
# Run from Hadoop
hadoop jar ./parquet-tools-
.jar --help
hadoop jar ./parquet-tools-
.jar
my_parquet_file.parq
# Run locally
java -jar ./parquet-tools-
.jar --help
java -jar ./parquet-tools-
.jar
my_parquet_file.parqExample usage to display a schema:
$ hadoop jar parquet-tools-1.8.0.jar schema 20200515160701.parquet
message t_staff_info_partition {
optional int64 age;
optional binary dt (UTF8);
optional int64 id;
optional binary name (UTF8);
optional binary updated_time (UTF8);
}The tool can be obtained from Maven Central: https://mvnrepository.com/artifact/org.apache.parquet/parquet-tools
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.