Big Data 9 min read

An Overview of Apache Parquet: Architecture, Features, and Comparison with ORC

Apache Parquet is a language‑agnostic, columnar storage format for the Hadoop ecosystem that offers high compression, efficient I/O through column and predicate push‑down, nested‑structure support, and a three‑layer architecture, and is compared with ORC while providing tooling for schema inspection.

Big Data Technology Architecture

May 19, 2020

An Overview of Apache Parquet: Architecture, Features, and Comparison with ORC

Parquet is a mainstream column‑oriented storage format in the Hadoop ecosystem, originally co‑developed by Twitter and Cloudera and graduated to an Apache top‑level project in May 2015.

It supports nested data structures and is especially suited for OLAP workloads, where columnar storage and scanning provide performance benefits.

The format achieves two main advantages: a higher compression ratio (e.g., gzip, snappy, and no‑compression achieve roughly 11×, 27×, and 19× compression respectively) and reduced I/O through column‑level and predicate push‑down, allowing only the required columns to be read.

Column‑level push‑down means that during data retrieval only the needed columns are scanned, while predicate push‑down filters rows as early as possible, minimizing unnecessary data scans and improving performance, especially when many columns are present.

Parquet is language‑agnostic and not tied to any specific processing framework. It integrates with many query engines (Hive, Impala, Pig, Presto, Drill, Tajo, HAWQ, IBM Big SQL) and compute frameworks (MapReduce, Spark, Cascading, Crunch, Scalding, Kite), and works with various data models such as Avro, Thrift, Protocol Buffer, and POJOs.

The project is organized into three layers: the data‑storage layer (parquet‑format defines metadata, primitive types, page types, encoding, compression, etc.), the object‑conversion layer (parquet‑mr maps external object models to Parquet’s internal model using striping and assembly algorithms), and the object‑model layer (adapters for Avro, Thrift, Protocol Buffer, Hive SerDe, etc.). For example, the parquet‑pig module serializes Pig Tuples into Parquet columns and deserializes them back.

Parquet’s schema is expressed with a message definition where each field has a repetition attribute (required/repeated/optional), a primitive or group type, and a name. An example schema is shown below:

message AddressBook {
  required string owner;
  repeated string ownerPhoneNumbers;
  repeated group contacts {
    required string name;
    optional string phoneNumber;
  }
}

The storage model consists of Row Groups, Column Chunks, and Pages. Row Groups align with HDFS block sizes and are processed by a single mapper. Each Column Chunk stores data of a single column and may use different compression. Pages are the smallest encoding units within a column chunk, and each can use a distinct encoding.

Parquet files also contain a header and footer that store checksums and schema information.

When compared with ORC, Parquet offers better nested‑structure support, while ORC provides ACID and update capabilities. Compression and query performance are comparable, though ORC may have a slight edge. Query‑engine support differs: Parquet works with Hive, Impala, Presto, etc., whereas ORC is tightly coupled with Hive and only recently gained experimental support in Impala.

For inspecting Parquet files, the open‑source parquet‑tools utility can display metadata and schema. Example usage:

# Run from Hadoop
hadoop jar ./parquet-tools-<VERSION>.jar --help
hadoop jar ./parquet-tools-<VERSION>.jar <command> my_parquet_file.parq
# Run locally
java -jar ./parquet-tools-<VERSION>.jar --help
java -jar ./parquet-tools-<VERSION>.jar <command> my_parquet_file.parq

Running

hadoop jar parquet-tools-1.8.0.jar schema 20200515160701.parquet

yields a schema such as:

message t_staff_info_partition {
  optional int64 age;
  optional binary dt (UTF8);
  optional int64 id;
  optional binary name (UTF8);
  optional binary updated_time (UTF8);
}

Further details can be found in the official Parquet documentation and related articles.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Columnar Storage Parquet Data Formats ORC Comparison Apache Hadoop

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.