Big Data 16 min read

Why DataFusion is Revolutionizing Big Data Queries with Rust and Arrow

This article introduces DataFusion, a high‑performance, Rust‑based query engine that leverages Apache Arrow’s columnar memory format to enable fast, extensible data processing across multiple storage formats and cloud sources, explains its architecture, execution model, and provides practical Rust code examples for custom extensions.

360 Zhihui Cloud Developer

Sep 9, 2024

Why DataFusion is Revolutionizing Big Data Queries with Rust and Arrow

Introduction

DataFusion is a query engine that does not store data itself. By not depending on underlying storage formats, it becomes a flexible and extensible engine. It natively supports CSV, Parquet, Avro, JSON and can read from local files, AWS S3, Azure Blob Storage, Google Cloud Storage, etc. It also provides rich extension interfaces for custom data formats and sources.

Key Features

High performance: built with Rust, no garbage collection, development speed comparable to Java or Go, performance similar to C++; uses Arrow columnar memory model for vectorized computation.

Easy integration: part of the Apache Arrow ecosystem (Arrow, Parquet, Flight) and works well with other big‑data components.

Simple integration and customization: users can extend scalar/aggregate/window functions, data sources, SQL, other query languages, custom plans, execution nodes, optimizer processes, etc.

Purpose of Introducing DataFusion in Qilin Data Warehouse

Leverage Rust’s performance and Arrow’s columnar storage to make Qilin the preferred query engine for databases, data‑frame libraries, and machine‑learning systems.

Use DataFusion’s efficient, flexible user interface to customize data sources, implement inverted indexes, and add custom query functions for acceleration.

Benefit from DataFusion’s vectorized execution to improve overall performance.

Why Rust?

Rust has been one of the most popular languages worldwide for several years. It offers memory safety without a garbage collector, performance comparable to C++, and the ability to call C/C++ code directly.

Rust’s safety comes from its ownership system, borrow checker, and the use of Option<T> to eliminate null‑pointer dereferences, providing “zero‑cost abstractions”.

Apache Arrow Overview

Apache Arrow, launched by Wes McKinney in 2016, addresses the lack of a unified in‑memory data format, inefficient use of modern hardware, and excessive memory copies in big‑data pipelines.

Arrow provides a columnar, in‑memory data structure that enables zero‑copy shared memory, efficient file format reading/writing (CSV, ORC, Parquet), and high‑performance analytics.

Traditional row‑oriented memory layouts waste CPU cache and hinder SIMD acceleration, while Arrow’s columnar layout reduces cache misses and enables vectorized processing.

DataFusion Architecture

DataFusion consists of:

SQL parsing (using sqlparser) to produce an abstract syntax tree, which is transformed into logical plans and expressions.

Logical plan optimization (AnalyzerRules, OptimizerRules) for projection push‑down, filter push‑down, etc.

Physical planning that converts logical plans into execution plans, applying physical optimizer rules such as sort order and join selection.

Execution using Apache Arrow memory format; execution plans produce one or more partitions via streams (e.g., SendableRecordBatchStream) with parallelism managed by Tokio.

DataFusion Execution Engine Features

Streaming execution: operators output Arrow arrays as incremental RecordBatches.

Parallel execution: each ExecutionPlan runs on multiple streams, coordinated via hash joins, repartitioning, and Tokio tasks.

Thread scheduling with Tokio async runtime.

Memory management via MemoryPool (GreedyPool or FairPool) to track and share memory across concurrent queries.

Cache management for object‑store listings and file metadata.

Extensibility: custom data sources, catalogs, schemas, query languages, user‑defined functions, custom optimizer and planner rules, and custom file formats or object stores.

Practical Usage

DataFusion CLI

An interactive command‑line tool that runs SQL queries against any supported data file (CSV, Parquet, JSON, Arrow, Avro) from local paths or remote locations such as S3.

Rust Programming Example

Add dependencies:

[dependencies]
datafusion = "40.0.0"
tokio = { version = "1.0", features = ["rt-multi-thread"] }

1. Execute a SQL query directly:

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    // register the table
    let ctx = SessionContext::new();
    ctx.register_csv("example", "tests/data/example.csv", CsvReadOptions::new()).await?;

    // create a plan to run a SQL query
    let df = ctx.sql("SELECT a, MIN(b) FROM example WHERE a <= b GROUP BY a LIMIT 100").await?;

    // execute and print results
    df.show().await?;
    Ok(())
}

2. Use the DataFrame API:

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    let ctx = SessionContext::new();
    let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;
    let df = df.filter(col("a").lt_eq(col("b")))?
               .aggregate(vec![col("a")], vec![min(col("b"))])?
               .limit(0, Some(100))?;
    df.show().await?;
    Ok(())
}

DataFrames represent a set of rows with named columns, similar to Pandas or Spark DataFrames, and are built via SessionContext methods such as read_csv, then transformed with operations like filter, select, aggregate, and limit.

Custom Data Source Implementation

To add a custom data source, implement the TableProvider trait, especially the scan method that returns an ExecutionPlan. The execution plan must provide a method to produce a stream of RecordBatch objects and can optionally override supports_filters_pushdown.

/// Trait that a custom data source must implement
pub trait TableProvider: Sync + Send {
    // ... other methods ...
    async fn scan(
        &self,
        state: &SessionState,
        projection: Option<&Vec<usize>>,
        filters: &[Expr],
        limit: Option<usize>,
    ) -> Result<Arc<dyn ExecutionPlan>>;
    // ...
}

impl ExecutionPlan for CustomExec {
    fn execute(
        &self,
        _partition: usize,
        _context: Arc<TaskContext>,
    ) -> Result<SendableRecordBatchStream> {
        // ... implementation ...
    }
}

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Rust Apache Arrow Query Engine DataFusion

Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.