Improving Hive Storage Efficiency: From RCFile to ORCFile at Facebook
Facebook’s data warehouse, storing over 300 PB and growing by 600 TB daily, transitioned from the RCFile format to an optimized ORCFile implementation, achieving 5‑8× better compression and up to three‑fold faster write performance while maintaining high read efficiency.
RCFile
Facebook initially stored Hive tables using its own Record‑Columnar File Format (RCFile), a hybrid layout that allows row‑wise queries while providing column‑wise compression efficiency. Data is first split into row groups, then each group is vertically partitioned by column, resulting in contiguous column blocks on disk.
When all columns of a row group are flushed, RCFile compresses each column with algorithms similar to zlib/lzo. Lazy decompression skips columns that are not needed by a query, delivering up to a 5× compression ratio in Facebook’s tests.
Beyond RCFile: What Comes Next?
To further improve compression as data volumes grew, engineers explored column‑level encoding techniques such as run‑length encoding, dictionary encoding, and frame‑of‑reference encoding, as well as new column types like structured JSON. Experiments showed that proper column‑level encoding could significantly boost RCFile’s compression ratio.
Hortonworks was also developing a similar approach with ORCFile, providing a solid foundation for Facebook’s new storage format.
ORCFile Details
When Hive writes data in ORCFile format, the data is divided into 256 MB stripes (analogous to RCFile row groups). Within each stripe, columns are encoded first, then the entire stripe is compressed using a zlib‑like algorithm. String columns use dictionary encoding across the whole stripe, and each stripe stores an index for every 10,000 rows, recording min/max values per column to enable predicate pruning.
Unlike RCFile, ORCFile records column and row offsets instead of using delimiter bytes, eliminating the need for special ASCII separators and allowing the query engine to leverage stripe‑level metadata for faster scans.
Adaptive Column Encoding
Testing revealed that a uniform dictionary encoding for all string columns could cause data bloat when column entropy is high. Facebook therefore adopted a dynamic approach: either statically specify encoding via column metadata or decide at runtime based on sampled column values. The latter was chosen for compatibility with the large existing table set.
For each stripe, a threshold on the number of distinct values determines whether dictionary encoding is applied. If the character set of a column is small, generic compressors like zlib already achieve good ratios, making dictionary encoding unnecessary or even harmful.
For large integer columns, run‑length encoding or dictionary encoding is used; dictionary encoding excels when the column contains few distinct values. Applying these heuristics to both string and numeric columns yields substantial compression gains for ORCFile.
Additional experiments included adaptive run‑length encoding (only used when it improves compression) and evaluating stripe size impact. Increasing stripe size beyond 256 MB did not significantly improve compression because larger stripes increase dictionary size, offsetting benefits.
Write Performance
To keep write throughput high at Facebook’s scale, the open‑source ORCFile write path was optimized by eliminating redundant operations and improving memory usage.
The most critical improvement was replacing the red‑black tree used for ordered dictionaries with a hash map, sorting only when necessary. This reduced dictionary memory by 30 % and increased write speed by 1.4×. Further gains came from using the Airlift library’s Slice class for efficient memory copies, adding another 20‑30 % improvement.
Column‑wise dictionary encoding is computationally expensive; Facebook therefore decides per‑stripe whether to apply it, reusing the decision for subsequent stripes when appropriate. The optimized ORCFile also allows lowering the zlib compression level while still achieving better overall compression, yielding an additional 20 % write‑performance boost.
Read Performance
Lazy column decompression is crucial for queries that filter on a subset of columns. By leveraging ORCFile’s index strides, Facebook’s read path only decompresses and decodes the columns involved in the filter, skipping the rest.
This lazy strategy made simple select queries three times faster than the open‑source version and outperformed RCFile in the same workload.
Summary
Applying these enhancements, Facebook’s ORCFile achieves a 5‑8× compression improvement over RCFile and up to three‑fold faster write performance on typical warehouse workloads.
The new format has already been deployed on dozens of petabytes of tables, freeing tens of petabytes of storage capacity and delivering both storage efficiency and read/write speed gains. Facebook’s ORCFile code is open‑source and contributions are being merged into Apache Hive.
What’s Next?
Future work includes supporting newer compression codecs such as LZ4HC, using different codecs per column, storing richer statistics for query engines, and bringing predicate push‑down from the open‑source community into Facebook’s stack. Additional ideas involve reducing logical redundancy between source and derived tables, sampling cold datasets, and adding native Hive data types for commonly stored string data.
The effort was a collaboration between Facebook’s Analytics Infrastructure team (authors Pamela Vagata, Kevin Wilfong, and Sambavi Muthukrishnan) and colleagues at Hortonworks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Art of Distributed System Architecture Design
Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
