Databases 12 min read

How a Student Built an ORC Chunk Writer for StarRocks: Insights from Open Source Summer

In this interview, graduate student Sun Yinzhen shares how he selected, designed, and implemented an ORC Chunk Writer for the StarRocks database during the Open Source Summer program, detailing the technical challenges, learning outcomes, and his perspective on open‑source contributions for computer science students.

StarRocks

Feb 29, 2024

Project Overview

The ORC Chunk Writer adds support for writing StarRocks Chunk data into ORC files, a columnar storage format widely used in the Hadoop ecosystem. The writer serializes a batch of tuples (a Chunk) as an ORC stripe using the ORC write API, enabling efficient storage and faster Hive queries.

Key Technical Concepts

Chunk abstraction: In StarRocks, a Chunk is the minimal unit processed by operators. It stores columnar tuples in memory and is the input to the writer.

Vectorized execution engine: Operators execute in a vectorized pipeline, which improves CPU cache utilization and reduces virtual function overhead.

ORC file format: ORC organizes data into stripes, each containing a batch of rows. Writing a Chunk maps naturally to an ORC stripe.

Asynchronous I/O model: StarRocks pipelines delegate I/O to a global thread pool. The writer must submit asynchronous write tasks rather than performing blocking I/O.

Development Challenges and Solutions

Understanding ORC read/write flow and Chunk APIs: Initially unfamiliar, the developer studied the existing Parquet writer implementation to learn the required ORC APIs and Chunk manipulation functions.

StarRocks testing procedures: Clarified the distinction between unit tests (functional correctness) and benchmark tests (performance evaluation) with the project mentor.

Integrating with the pipeline: Gained knowledge of coroutine scheduling and the global I/O thread pool to correctly enqueue asynchronous ORC write tasks.

Implementation Details

Read the Chunk data structure and extract column vectors.

Map each column vector to the corresponding ORC column type.

Create an ORC Writer instance with appropriate schema and compression settings.

For each Chunk, batch the rows into an ORC stripe using Writer::addRowBatch (or equivalent API).

Submit the write operation to the StarRocks I/O thread pool to achieve non‑blocking persistence.

Testing and Validation

Two test categories were used:

Unit tests: Verify that the writer correctly serializes column data, respects schema, and produces a valid ORC file that can be read back.

Benchmark tests: Measure write throughput and latency compared with existing Parquet writer, ensuring that asynchronous I/O does not become a bottleneck.

Technical Takeaways

Deep familiarity with StarRocks’ vectorized engine and Chunk abstraction is essential for extending storage capabilities.

Understanding columnar file formats (ORC, Parquet) simplifies mapping in‑memory structures to on‑disk representations.

Proper use of StarRocks’ coroutine and global I/O thread pool is required for efficient asynchronous writes.

StarRocks ORC Vectorized Engine Student Contribution

Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.