Big Data 31 min read

Accelerating Spark with ClickHouse Native Techniques: Design, Implementation, and Performance Evaluation

The paper presents a Spark acceleration framework that replaces Java‑based task operators with a ClickHouse native library, converting plans via Protobuf and JNI, leveraging columnar storage, SIMD and JIT to achieve up to 3× speed‑up on TPC‑DS workloads while providing fallback mechanisms to ensure no performance loss.

Baidu Geek Talk

Jun 24, 2024

Accelerating Spark with ClickHouse Native Techniques: Design, Implementation, and Performance Evaluation

This article is based on a presentation from the DataFunSummit 2024 OLAP Architecture Summit, focusing on accelerating the Spark compute engine using native technologies and ClickHouse.

1. Spark Performance Optimization Background

Spark originated from UC Berkeley AMPLab and has become a leading open‑source project for offline big‑data processing, data science, machine learning, and streaming. Its execution model consists of a driver that generates logical and physical plans, splits the physical plan into stages, and dispatches thousands of tasks to executors.

Two key observations from the execution flow are:

The driver performs complex planning with negligible resource consumption.

Tasks consume the majority of CPU and memory resources, and each task executes a series of SQL operators (e.g., Scan, Agg, Shuffle) implemented in Java on the JVM.

Because JVM‑based tasks cannot fully exploit SIMD instructions, low‑level CPU features, or efficient code generation, Java implementations are often 10‑plus times slower than equivalent C++ code.

2. ClickHouse Performance Advantages

ClickHouse is a mature OLAP engine that stores data column‑wise, enabling effective CPU pipeline utilization, SIMD vectorization, and aggressive loop unrolling. Columnar storage allows the CPU to prefetch data efficiently, reducing cache pressure and improving instruction‑level parallelism.

Example SQL demonstrating columnar benefits: select f1 + f2, f3, f4, f5, f6 * f7 from tbl When each column is stored contiguously, the CPU can process the entire column with a single SIMD add instruction, achieving up to four‑fold speed‑up.

3. Spark Native Acceleration Design and Implementation

The proposed solution replaces the Java‑based task logic with a ClickHouse‑based native library:

In the driver, the physical plan is transformed into a ClickHouse‑compatible execution plan.

The plan is serialized (using Protobuf) and sent to executors via JNI.

Executors load the ClickHouse dynamic library, reconstruct the corresponding C++ operators, and perform data processing.

Key design principles:

Maintain full SQL‑semantic compatibility with Spark.

Leverage ClickHouse’s columnar execution, SIMD, and JIT compilation to maximize performance.

Additional components include custom SparkShuffleSink and SparkShuffleSource operators to bridge Spark’s shuffle framework with ClickHouse.

4. Fallback Mechanism

Because not all Spark operators are yet supported in ClickHouse, a fallback strategy is employed:

Operator‑level fallback (e.g., Scan in Spark, Filter in ClickHouse).

Stage‑level fallback (switching between Spark and ClickHouse between stages).

Driver‑level fallback (entire query runs on native Spark when ClickHouse cannot execute it).

The system prefers ClickHouse for supported operators while guaranteeing that performance never degrades compared to pure Spark.

5. Performance Evaluation

Using a 10 TB TPC‑DS benchmark, the Spark‑Native solution achieved an average 2.3× speed‑up over vanilla Spark. Certain queries (e.g., Q23) saw >3× improvement, while others (e.g., Q72) fell back to Spark due to complex join patterns that are not yet optimized in ClickHouse.

Memory consumption was also reduced, further lowering operational costs.

6. Practical Usage

Users can enable the acceleration in Baidu Intelligent Cloud BMR by setting the configuration spark.sql.execution.clickhouse=true. Currently, Parquet and ORC file formats are supported.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data ClickHouse SQL Engine Spark Native Acceleration

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.