Big Data 22 min read

Mastering Apache Parquet: Columnar Storage, Nested Data, and Performance Gains

This article explains Apache Parquet’s columnar storage format, its support for nested data models, the underlying striping/assembly algorithm, file structure, push‑down optimizations, and performance advantages within the Hadoop ecosystem, providing a comprehensive guide for big‑data practitioners.

dbaplus Community

May 26, 2016

Mastering Apache Parquet: Columnar Storage, Nested Data, and Performance Gains

What Is Apache Parquet?

Apache Parquet is a column‑oriented storage format designed for the Hadoop ecosystem. It works with most compute frameworks (Hadoop, Spark, Hive, Impala, etc.) and query engines, and is language‑ and platform‑agnostic. Originating from a collaboration between Twitter and Cloudera, Parquet graduated to a top‑level Apache project in May 2015, with the latest stable release being 1.8.1.

Inspiration and Nested Data Model

Parquet’s design is inspired by Google’s 2010 Dremel paper, which introduced a columnar format capable of handling nested structures. Modern big‑data workloads often produce deeply nested logs (e.g., Twitter logs with seven‑level nesting), making native support for structs, maps, and arrays essential for efficient querying.

Columnar Storage Benefits

Storing data by column rather than by row enables several optimizations:

Uniform column types allow specialized encoding and compression, reducing storage space.

Projection push‑down reads only the required columns, cutting I/O and enabling predicate push‑down.

CPU‑friendly encodings improve cache utilization.

Parquet Project Components

Parquet consists of multiple sub‑projects:

parquet-format : Java implementation defining metadata objects, serialized with Apache Thrift.

parquet-mr : Java library providing read/write support and Hadoop Input/Output formats, Hive SerDe, Pig loaders, etc.

parquet-compatibility : Test code for cross‑language compatibility (Java and C/C++).

parquet-cpp : C++ library for reading and writing Parquet files.

Data Model and Schema

Parquet schemas consist of fields that can be required, optional, or repeated. Types are either primitive or group. The Dremel‑style schema enables representation of complex nested data without explicit map or array types, using repeated groups to model such structures.

Striping/Assembly Algorithm

Parquet reconstructs records from columnar storage using repetition and definition levels:

Repetition Level : Indicates the depth at which a value is not shared with the previous value, enabling the creation of new repeated nodes during read.

Definition Level : Marks the depth where a value becomes undefined (NULL), allowing placeholders for missing optional or repeated fields.

These levels keep encoding compact and support efficient storage of sparse or repeated data.

Parquet File Structure

Parquet files are binary and self‑describing. Key components include:

HDFS Block : Smallest replication unit on HDFS.

Row Group : A logical division of rows stored together; typically aligns with a mapper’s input split.

Column Chunk : All values of a column within a row group, stored contiguously.

Page : Smallest unit of encoding inside a column chunk (data page, dictionary page, index page).

Push‑Down Optimizations

Parquet natively supports:

Projection Push‑Down : Only the columns required by a query are read, reducing I/O.

Predicate Push‑Down : Column statistics (min, max, null count) are stored in metadata, allowing the engine to skip entire row groups that cannot satisfy filter predicates.

Additional future enhancements include Bloom filters and richer statistics to further improve predicate push‑down.

Performance Characteristics

Compared with row‑oriented formats, Parquet offers higher compression ratios and lower I/O due to columnar encoding and push‑down capabilities. Benchmarks on TPC‑H and TPC‑DS datasets show Parquet achieving smaller file sizes and faster query times than formats like RC or ORC, especially when queries target a subset of columns.

Project Evolution

Since its 2012 inception by Twitter and Cloudera, Parquet has attracted contributors such as Criteo and many open‑source query engines. The project is now advancing to version 2.0, introducing a new page format, additional primitive types (Decimal, Timestamp), and richer metadata like Bloom filters.

Conclusion

Parquet provides a robust columnar storage solution for big‑data OLAP workloads, offering efficient encoding, compression, and push‑down optimizations that translate into tangible performance gains. Its native support for nested schemas makes it especially suitable for modern, complex data pipelines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data data compression Hadoop nested data Apache Parquet

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.