Big Data 8 min read

An Introduction to Apache Parquet: Architecture, Data Model, File Format, and Basic Operations

This article provides a comprehensive overview of Apache Parquet, covering its purpose, architectural components, nested data model, file structure, practical Hive commands for creating and inspecting Parquet tables, and a brief introduction to the TPC‑DS benchmark for performance testing.

Big Data Technology Architecture

Jun 9, 2019

An Introduction to Apache Parquet: Architecture, Data Model, File Format, and Basic Operations

Parquet is an analytical, column‑oriented storage format compatible with frameworks such as Spark, Hive, and Impala, and supports data models like Avro, Thrift, and Protocol Buffers. It has become the de‑facto standard for HDFS storage in offline data warehouses and OLAP scenarios.

Parquet Introduction – The article explains the motivation behind using Parquet, highlighting the advantages of columnar storage over row‑based storage, such as reduced I/O through column pruning and predicate push‑down, as well as more efficient compression and encoding.

Architecture Overview – Parquet is an Apache top‑level project consisting of five main modules: parquet-format, parquet-mr, parquet-cpp, parquet-rs, and parquet-compatibility. These modules define the format specifications, provide read/write implementations for Hadoop, Spark, and other ecosystems, and ensure cross‑language compatibility.

Data Model – Parquet supports a nested schema similar to Protocol Buffers. Each schema contains fields with three attributes: repetition (required, repeated, optional), type (group or primitive), and name. An example schema is shown below:

message AddressBook {
  required string owner;
  repeated string ownerPhoneNumbers;
  repeated group contacts {
    required string name;
    optional string phoneNumber;
  }
}

File Format – A Parquet file consists of file metadata, row groups, column chunks, and data pages. Row groups store rows horizontally, column chunks store column data vertically, and each data page is the smallest unit of encoding. The file also contains a magic number in the header for validation.

Basic Operations

1. Creating Parquet tables in Hive – Example DDL: create table t1 (id int) stored as parquet; 2. Converting TextFile tables to Parquet – Example commands to drop, create, and set compression (Snappy) and block size.

3. Viewing Parquet file schema – The parquet-tools utility can display schema and other metadata. Example usage:

# Run from Hadoop
hadoop jar ./parquet-tools-<VERSION>.jar --help
hadoop jar ./parquet-tools-<VERSION>.jar <command> my_parquet_file.par

# Run locally
java -jar ./parquet-tools-<VERSION>.jar --help
java -jar ./parquet-tools-<VERSION>.jar <command> my_parquet_file.par

TPC‑DS Benchmark Introduction – TPC‑DS is a standard big‑data benchmark that models star and snowflake schemas with 24 tables (7 fact tables, 17 dimension tables). The article shows how to generate a 10 GB sample dataset using dsdgen and briefly mentions performance comparison between TextFile and Parquet formats.

$ cd ~/training/tpcds/v2.3.0/tools
$ nohup ./dsdgen -scale 10 -dir ~/data_10g &

Overall, the article serves as a practical guide for engineers working with Parquet in data‑warehouse environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data modeling Hive Columnar Storage Spark SQL Parquet TPC-DS

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.