Understanding ORC File Format in Hive: Structure, Storage, Indexes, Compression, and Configuration
This article explains the ORC (Optimized Record Columnar) file format used in Hive, covering its architecture, stripe and column storage, handling of complex data types, indexing mechanisms, compression streams, memory management, and key configuration parameters.
1. ORC File Format
ORC stands for Optimized Record Columnar. Using the ORC file format can improve Hive's read, write, and data processing performance. Compared with RCFile, ORC offers several advantages:
Specific serialization and deserialization allow the ORC writer to write data according to its type.
Multiple indexes, absent in RCFile, enable the ORC reader to quickly locate needed data and skip irrelevant data.
Because the writer knows the data type, ORC supports complex structures such as maps.
Additional implementation benefits include larger default stripe size and a memory manager for the writer.
2. ORC Data Storage Method
In an ORC‑formatted Hive table, records are first horizontally split into multiple stripes . Within each stripe, data is stored column‑wise, and all columns share the same file. The default stripe size is 256 MB, larger than RCFile's 4 MB stripes, which makes reads more efficient.
For complex types such as Map, ORC parses a field into multiple sub‑fields. The table below shows how different data types are expanded:
Data type
Child columns
Array
A single sub‑field containing all array elements
Map
Two sub‑fields: one for keys, one for values
Struct
Each attribute becomes a sub‑field
Union
Each attribute becomes a sub‑field
After parsing, the field types form a field tree; only leaf nodes store actual table data, which constitute the data stream (see the diagram). Metadata for each non‑leaf node is stored in the meta stream, e.g., an array’s length.
In Hive‑0.13, ORC supports reading specific columns but not partial reads of complex types.
When using ORC, each HDFS block can store one stripe. The stripe size should be smaller than the HDFS block size; otherwise a stripe may span multiple blocks, causing remote reads. If a stripe does not fit in the remaining space of a block, the writer continues in the next block.
3. Indexes
ORC adds sparse indexes to speed up data retrieval from HDFS. Two main indexes are used:
3.1 Data Statistics
The ORC reader uses this index to skip unnecessary data. The writer creates it, recording row count, max, min, sum, and for text/binary fields also length. For complex types (Array, Map, Struct, Union), statistics are kept for each sub‑field.
Data statistics exist at three levels:
File level : Stored at the file footer, summarizing column statistics for the whole file, useful for query optimization and simple aggregations.
Stripe level : Statistics per stripe allow the reader to decide which stripes need to be read based on query predicates.
Index group level : Columns are divided into groups (default 10 000 rows). Each group records statistics, enabling finer‑grained filtering. The group size can be tuned.
3.2 Position Pointers
When reading an ORC file, the reader needs two positions:
Start of each group's metadata and data streams within a stripe.
Start of each stripe, stored in the file footer.
4. File Compression
ORC applies a two‑stage compression: first a stream encoder, then an optional compressor (ZLIB, Snappy, or LZO). A column may be stored in one or more streams of four types:
Byte Stream : Raw bytes.
Run Length Byte Stream : Stores repeated byte sequences with length.
Integer Stream : Stores integers, possibly with delta encoding.
Bit Field Stream : Stores booleans; implemented using Run Length Byte Stream.
Examples:
Integer : Uses a bit stream to mark nulls and an integer stream for values.
String : If distinct values occupy ≤ 80 % of non‑null rows, dictionary encoding is used, involving a bit stream, a byte stream, and two integer streams. Otherwise, a plain byte stream and an integer stream for lengths are used.
Compression units default to 256 KB.
5. Memory Management
The ORC writer keeps an entire stripe in memory. When many writers run concurrently, memory pressure can arise. ORC introduces a memory manager that sets a threshold; writers register their stripe size, and if total registration exceeds the threshold, each stripe is proportionally shrunk. When a writer finishes, its registration is released.
6. Configuration Parameters
Parameter
Default Value
Description
hive.exec.orc.default.stripe.size
256*1024*1024
Default stripe size
hive.exec.orc.default.block.size
256*1024*1024
Default HDFS block size for ORC files (since Hive‑0.14)
hive.exec.orc.dictionary.key.size.threshold
0.8
Threshold for using dictionary encoding on string columns
hive.exec.orc.default.row.index.stride
10000
Group size within a stripe
hive.exec.orc.default.compress
ZLIB
Default compression codec for ORC files
hive.exec.orc.skip.corrupt.data
false
Whether to skip corrupt records (true) or throw an exception (false)
For more parameters, see the official Hive documentation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
