Big Data 14 min read

Understanding ORC File Format in Hive: Structure, Storage, Indexes, Compression, and Configuration

This article explains the ORC (Optimized Record Columnar) file format used in Hive, covering its architecture, stripe and column storage, handling of complex data types, indexing mechanisms, compression streams, memory management, and key configuration parameters.

Big Data Technology & Architecture

Jan 13, 2020

Understanding ORC File Format in Hive: Structure, Storage, Indexes, Compression, and Configuration

1. ORC File Format

ORC stands for Optimized Record Columnar. Using the ORC file format can improve Hive's read, write, and data processing performance. Compared with RCFile, ORC offers several advantages:

Specific serialization and deserialization allow the ORC writer to write data according to its type.

Multiple indexes, absent in RCFile, enable the ORC reader to quickly locate needed data and skip irrelevant data.

Because the writer knows the data type, ORC supports complex structures such as maps.

Additional implementation benefits include larger default stripe size and a memory manager for the writer.

2. ORC Data Storage Method

In an ORC‑formatted Hive table, records are first horizontally split into multiple stripes . Within each stripe, data is stored column‑wise, and all columns share the same file. The default stripe size is 256 MB, larger than RCFile's 4 MB stripes, which makes reads more efficient.

For complex types such as Map, ORC parses a field into multiple sub‑fields. The table below shows how different data types are expanded:

Data type

Child columns

Array

A single sub‑field containing all array elements

Map

Two sub‑fields: one for keys, one for values

Struct

Each attribute becomes a sub‑field

Union

Each attribute becomes a sub‑field

After parsing, the field types form a field tree; only leaf nodes store actual table data, which constitute the data stream (see the diagram). Metadata for each non‑leaf node is stored in the meta stream, e.g., an array’s length.

In Hive‑0.13, ORC supports reading specific columns but not partial reads of complex types.

When using ORC, each HDFS block can store one stripe. The stripe size should be smaller than the HDFS block size; otherwise a stripe may span multiple blocks, causing remote reads. If a stripe does not fit in the remaining space of a block, the writer continues in the next block.

3. Indexes

ORC adds sparse indexes to speed up data retrieval from HDFS. Two main indexes are used:

3.1 Data Statistics

The ORC reader uses this index to skip unnecessary data. The writer creates it, recording row count, max, min, sum, and for text/binary fields also length. For complex types (Array, Map, Struct, Union), statistics are kept for each sub‑field.

Data statistics exist at three levels:

File level : Stored at the file footer, summarizing column statistics for the whole file, useful for query optimization and simple aggregations.

Stripe level : Statistics per stripe allow the reader to decide which stripes need to be read based on query predicates.

Index group level : Columns are divided into groups (default 10 000 rows). Each group records statistics, enabling finer‑grained filtering. The group size can be tuned.

3.2 Position Pointers

When reading an ORC file, the reader needs two positions:

Start of each group's metadata and data streams within a stripe.

Start of each stripe, stored in the file footer.

4. File Compression

ORC applies a two‑stage compression: first a stream encoder, then an optional compressor (ZLIB, Snappy, or LZO). A column may be stored in one or more streams of four types:

Byte Stream : Raw bytes.

Run Length Byte Stream : Stores repeated byte sequences with length.

Integer Stream : Stores integers, possibly with delta encoding.

Bit Field Stream : Stores booleans; implemented using Run Length Byte Stream.

Examples:

Integer : Uses a bit stream to mark nulls and an integer stream for values.

String : If distinct values occupy ≤ 80 % of non‑null rows, dictionary encoding is used, involving a bit stream, a byte stream, and two integer streams. Otherwise, a plain byte stream and an integer stream for lengths are used.

Compression units default to 256 KB.

5. Memory Management

The ORC writer keeps an entire stripe in memory. When many writers run concurrently, memory pressure can arise. ORC introduces a memory manager that sets a threshold; writers register their stripe size, and if total registration exceeds the threshold, each stripe is proportionally shrunk. When a writer finishes, its registration is released.

6. Configuration Parameters

Parameter

Default Value

Description

hive.exec.orc.default.stripe.size

256*1024*1024

Default stripe size

hive.exec.orc.default.block.size

256*1024*1024

Default HDFS block size for ORC files (since Hive‑0.14)

hive.exec.orc.dictionary.key.size.threshold

0.8

Threshold for using dictionary encoding on string columns

hive.exec.orc.default.row.index.stride

10000

Group size within a stripe

hive.exec.orc.default.compress

ZLIB

Default compression codec for ORC files

hive.exec.orc.skip.corrupt.data

false

Whether to skip corrupt records (true) or throw an exception (false)

For more parameters, see the official Hive documentation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Hive Indexes compression File Format ORC parameters

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.