Understanding ORC File Format and Its Use in Hive and Java
This article explains the ORC (Optimized Row Columnar) file format, its advantages, internal structure, data model, compression mechanisms, and how to create Hive tables and write ORC files using Java, providing practical code examples and reference resources.
ORC (Optimized Row Columnar) is a columnar storage format introduced by Apache Hive in 2013 to reduce Hadoop storage space and speed up Hive query performance, similar to Parquet but with row‑group based organization.
Key advantages of ORC include high compression ratios, split‑ability for parallel processing, multiple indexing options (row group index, bloom filter), and support for complex data types such as MAP, LIST, and STRUCT.
Columnar storage stores each column’s values sequentially, allowing queries to read only the required columns, dramatically lowering I/O, enabling column‑level statistics for predicate push‑down, and improving CPU cache utilization.
The ORC file consists of a hierarchy of metadata: file‑level metadata, stripe metadata, row‑group metadata, and streams (PRESENT, DATA, LENGTH, DICTIONARY_DATA, SECONDARY, ROW_INDEX). Stripes contain multiple row groups, each storing column data in separate streams, and the file footer holds offsets and schema information.
Three levels of statistics (file, stripe, row group) allow the query engine to skip irrelevant data based on SearchArguments, further reducing I/O.
Compression in ORC uses a two‑stage mechanism: stream‑level encoding followed by optional compression (ZLIB, Snappy, LZO). Different stream types (Byte Stream, Run Length Byte Stream, Integer Stream, Bit Field Stream) are used depending on column data types.
When creating Hive tables, the storage format can be specified as ORC:
CREATE TABLE ... STORED AS ORC;
ALTER TABLE ... SET FILEFORMAT ORC;
SET hive.default.fileformat=Orc;Java code can be used to write ORC files directly, for example:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
import org.apache.orc.CompressionKind;
import org.apache.orc.OrcFile;
import org.apache.orc.TypeDescription;
import org.apache.orc.Writer;
public class TestORCWriter {
public static void main(String[] args) throws Exception {
Path testFilePath = new Path("/tmp/test.orc");
Configuration conf = new Configuration();
TypeDescription schema = TypeDescription.fromString("struct<field1:int,field2:int,field3:int>");
Writer writer = OrcFile.createWriter(testFilePath, OrcFile.writerOptions(conf).setSchema(schema).compress(CompressionKind.SNAPPY));
VectorizedRowBatch batch = schema.createRowBatch();
LongColumnVector first = (LongColumnVector) batch.cols[0];
LongColumnVector second = (LongColumnVector) batch.cols[1];
LongColumnVector third = (LongColumnVector) batch.cols[2];
final int BATCH_SIZE = batch.getMaxSize();
for (int r = 0; r < 15000000; ++r) {
int row = batch.size++;
first.vector[row] = r;
second.vector[row] = r * 3;
third.vector[row] = r * 6;
if (row == BATCH_SIZE - 1) {
writer.addRowBatch(batch);
batch.reset();
}
}
if (batch.size != 0) {
writer.addRowBatch(batch);
batch.reset();
}
writer.close();
}
}In most cases, converting existing text files to ORC within Hive is recommended; using Java to generate ORC files locally is a special‑case scenario.
References and further reading are provided at the end of the original article.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
