Big Data 31 min read

Accelerating Spark with ClickHouse Native Techniques: Design, Implementation, and Performance Evaluation

The paper presents a Spark acceleration framework that replaces Java‑based task operators with a ClickHouse native library, converting plans via Protobuf and JNI, leveraging columnar storage, SIMD and JIT to achieve up to 3× speed‑up on TPC‑DS workloads while providing fallback mechanisms to ensure no performance loss.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Accelerating Spark with ClickHouse Native Techniques: Design, Implementation, and Performance Evaluation

This article is based on a presentation from the DataFunSummit 2024 OLAP Architecture Summit, focusing on accelerating the Spark compute engine using native technologies and ClickHouse.

1. Spark Performance Optimization Background

Spark originated from UC Berkeley AMPLab and has become a leading open‑source project for offline big‑data processing, data science, machine learning, and streaming. Its execution model consists of a driver that generates logical and physical plans, splits the physical plan into stages, and dispatches thousands of tasks to executors.

Two key observations from the execution flow are:

The driver performs complex planning with negligible resource consumption.

Tasks consume the majority of CPU and memory resources, and each task executes a series of SQL operators (e.g., Scan, Agg, Shuffle) implemented in Java on the JVM.

Because JVM‑based tasks cannot fully exploit SIMD instructions, low‑level CPU features, or efficient code generation, Java implementations are often 10‑plus times slower than equivalent C++ code.

2. ClickHouse Performance Advantages

ClickHouse is a mature OLAP engine that stores data column‑wise, enabling effective CPU pipeline utilization, SIMD vectorization, and aggressive loop unrolling. Columnar storage allows the CPU to prefetch data efficiently, reducing cache pressure and improving instruction‑level parallelism.

Example SQL demonstrating columnar benefits:

select f1 + f2, f3, f4, f5, f6 * f7 from tbl

When each column is stored contiguously, the CPU can process the entire column with a single SIMD add instruction, achieving up to four‑fold speed‑up.

3. Spark Native Acceleration Design and Implementation

The proposed solution replaces the Java‑based task logic with a ClickHouse‑based native library:

In the driver, the physical plan is transformed into a ClickHouse‑compatible execution plan.

The plan is serialized (using Protobuf) and sent to executors via JNI.

Executors load the ClickHouse dynamic library, reconstruct the corresponding C++ operators, and perform data processing.

Key design principles:

Maintain full SQL‑semantic compatibility with Spark.

Leverage ClickHouse’s columnar execution, SIMD, and JIT compilation to maximize performance.

Additional components include custom SparkShuffleSink and SparkShuffleSource operators to bridge Spark’s shuffle framework with ClickHouse.

4. Fallback Mechanism

Because not all Spark operators are yet supported in ClickHouse, a fallback strategy is employed:

Operator‑level fallback (e.g., Scan in Spark, Filter in ClickHouse).

Stage‑level fallback (switching between Spark and ClickHouse between stages).

Driver‑level fallback (entire query runs on native Spark when ClickHouse cannot execute it).

The system prefers ClickHouse for supported operators while guaranteeing that performance never degrades compared to pure Spark.

5. Performance Evaluation

Using a 10 TB TPC‑DS benchmark, the Spark‑Native solution achieved an average 2.3× speed‑up over vanilla Spark. Certain queries (e.g., Q23) saw >3× improvement, while others (e.g., Q72) fell back to Spark due to complex join patterns that are not yet optimized in ClickHouse.

Memory consumption was also reduced, further lowering operational costs.

6. Practical Usage

Users can enable the acceleration in Baidu Intelligent Cloud BMR by setting the configuration spark.sql.execution.clickhouse=true . Currently, Parquet and ORC file formats are supported.

performance optimizationbig dataClickHouseSQL engineSparkNative Acceleration
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.