Big Data 20 min read

Building Data Systems with Apache Arrow: Architecture, Memory Format, and Execution

The article explains how Apache Arrow’s columnar, cross‑language in‑memory format enables high‑performance, interoperable data systems—replacing traditional row‑oriented databases—by supporting dynamic schemas, zero‑copy data exchange, efficient indexing, Acero‑based query execution, and Flight/ADBC connectivity, while offering practical guidance and highlighting challenges.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Building Data Systems with Apache Arrow: Architecture, Memory Format, and Execution

This article introduces the motivation for building a new data system to address performance and interoperability challenges in large‑scale data processing. It explains why traditional “one size fits all” databases (e.g., Oracle, MySQL) are insufficient for diverse workloads such as OLTP, OLAP, stream processing, and NoSQL.

The core technology presented is Apache Arrow, an open‑source, cross‑language, columnar in‑memory format that enables zero‑copy data exchange between systems. Arrow provides a standardized memory representation, eliminating the need for serialization/deserialization when moving data across language boundaries.

Key topics covered:

1. Memory Data Formats : Comparison of row‑oriented and column‑oriented storage, with diagrams showing how the same three‑column table (session_id, timestamp, ip) is laid out in memory for each format. Row storage suits transactional workloads, while column storage reduces I/O for analytical queries.

2. Advantages of Apache Arrow : Multi‑language support (C++, Rust, Python, etc.), broad ecosystem adoption (PyTorch, Spark, ClickHouse, DuckDB), good performance, extensibility (custom types, user‑defined functions), and a vibrant open‑source community.

3. Dynamic Schema (Read‑time Modeling) : Instead of pre‑defining schemas (write‑time modeling), Arrow enables storing raw logs and extracting fields on demand, which is essential for heterogeneous log formats (Nginx, Apache, IIS) and evolving data structures.

4. Data System Architecture : The system stores events (timestamp, raw payload, metadata) in Parquet files, which natively support Arrow. Record Batches represent in‑memory data; schemas can evolve between batches, allowing flexible data ingestion.

5. Indexing and Query Execution : Time‑stamp and inverted indexes are built on top of Arrow. SQL parsing is performed with external tools (ANTLR, Calcite) to generate abstract syntax trees, which are then transformed into logical and physical plans. Arrow’s Acero engine (push‑based) executes physical plans, with extensible execution nodes.

6. Extensions for Dynamic Schemas : Custom Acero nodes (e.g., Schemaless SinkNode) and delayed output schema generation enable processing of data without a fixed schema. Additional work adds dynamic support to aggregation and scalar functions.

7. Data Transfer : Arrow Flight and Flight SQL provide columnar data exchange over gRPC/REST, avoiding row‑column conversion overhead. Future Arrow ADBC aims to offer JDBC/ODBC‑like connectivity.

8. Practical Tips and Pitfalls : Frequent community updates, the three‑layer architecture of Arrow (Core, Compute, Acero), stability considerations, and recommendations to contribute improvements upstream. Advice to prefer Arrow Rust and the DataFusion engine for building new data products.

The presentation concludes with a summary of the benefits and challenges of using Apache Arrow as the foundation for modern data systems.

Big Datadata fusioncolumnar storageApache ArrowData SystemsMemory FormatQuery Engine
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.