Big Data 34 min read

Boost Spark Performance with ClickHouse: Native Acceleration Techniques

This article presents a detailed technical overview of accelerating Spark's compute engine using ClickHouse as a native backend, covering Spark performance background, ClickHouse's advantages, the design and implementation of a Spark‑Native acceleration solution, and extensive performance evaluation results.

Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Boost Spark Performance with ClickHouse: Native Acceleration Techniques

1 Spark Performance Optimization Background

Spark originated as an experimental project at UC Berkeley AMPLab, initially targeting iterative machine‑learning workloads with a distributed approach. Contributed to Apache in 2013, it has become one of the most popular open‑source big‑data platforms, widely used for offline processing, data‑science, machine‑learning, and streaming.

1.1 Spark Overview

Spark follows a driver‑executor model. The driver generates a logical plan, optimizes it through multiple stages, and produces a physical plan that is split into stages and tasks. Each task processes a data partition on an executor.

1.2 Spark SQL Overview

A typical SQL query is parsed into a logical plan, optimized, and then turned into a physical plan consisting of stages and tasks. The driver creates stages such as ShuffleMap and Result, each containing thousands of tasks that execute a fixed set of SQL operators (e.g., Scan, Agg, Shuffle).

Driver performs complex planning with minimal resource consumption (milliseconds).

Tasks consume the majority of CPU and memory resources.

1.3 Spark Task Overview

Tasks run inside the JVM and use a row‑based execution model. This design limits performance because JVM code cannot directly use SIMD instructions, suffers from heap‑based memory management overhead, and generates less efficient machine code.

For example, a Java implementation of a one‑million‑integer sum is about 11× slower than an equivalent C++ version that benefits from loop unrolling and SIMD.

1.4 Spark Task Performance Optimization Idea

The key idea is to replace Spark's row‑oriented execution with a column‑oriented approach implemented in C++. By rewriting the Scan, Agg, and Shuffle operators in C++ and using columnar data layouts, significant speedups can be achieved.

1.5 ClickHouse as Spark SQL Backend

ClickHouse is a mature OLAP engine with comprehensive SQL support. Most Spark SQL operators have equivalents in ClickHouse, allowing us to replace the task‑level data processing with a ClickHouse dynamic library invoked via JNI. Integration also requires handling RDD deserialization, shuffle coordination, broadcast data, and metrics.

2 ClickHouse Performance Advantages

2.1 CPU Pipeline – Foundations of Vectorization

Modern CPUs execute instructions through multiple pipeline stages (Fetch, Decode, Execute, Write‑back). Fully utilizing the pipeline maximizes throughput. Techniques such as prefetching and preloading data into caches are essential for high performance.

2.2 SIMD Introduction

SIMD (Single Instruction Multiple Data) allows a single instruction to operate on multiple data elements simultaneously (e.g., 128‑bit registers can process four 32‑bit integers at once), providing up to 4× speedup for vectorizable workloads.

select f1 + f2, f3, f4, f5, f6 * f7 from tbl

2.3 Native Techniques in ClickHouse

ClickHouse leverages loop unrolling and SIMD in its C++ kernels. For a simple aggregation query, the generated assembly shows use of xmm registers and the paddq SIMD instruction, achieving roughly four times the performance of scalar code.

2.4 C++ Templates in ClickHouse

ClickHouse extensively uses C++ templates and CRTP to eliminate virtual function overhead while providing type‑safe abstractions for hash tables and aggregation.

3 Spark Native Acceleration Design and Implementation

3.1 Design Principles

Maintain full compatibility with vanilla Spark (identical SQL semantics, minimal user configuration).

Maximize performance gains by minimizing data format conversions and keeping most processing in ClickHouse.

3.2 Execution Plan Integration

The driver generates a physical plan, which is serialized (via Protobuf) and sent to ClickHouse through JNI. Each Spark stage is represented by a single ClickHouseRDD that encapsulates all SQL operators for that stage.

3.3 SQL Semantic Compatibility – Decimal Types

Spark and ClickHouse differ in precision/scale rules for Decimal arithmetic. The native solution adjusts ClickHouse's type‑derivation logic to match Spark's semantics without sacrificing performance.

3.4 Shuffle Framework Compatibility

ClickHouse implements SparkShuffleSink and SparkShuffleSource operators to handle map‑side partitioning, index/file generation, and shuffle reads, reusing Spark's existing shuffle infrastructure.

3.5 Performance Optimization – Conditional Join

For semi‑joins with inequality predicates, the solution extends ClickHouse's codegen to evaluate expressions directly on column indices, avoiding costly Cartesian product materialization.

3.6 Fallback Mechanism

When an operator or stage cannot be accelerated, three fallback strategies are employed: Task‑level fallback, Stage‑level fallback, and Driver‑level fallback, ensuring that Spark's original performance is never degraded.

4 Acceleration Effect Analysis

4.1 TPC‑DS Performance

On a 10 TB TPC‑DS benchmark, Spark Native achieved a 2.3× overall speedup compared to vanilla Spark, halving cluster cost or runtime.

4.2 Detailed Q23 Analysis

Q23 showed a 3.5× improvement due to faster scans, vectorized joins, and columnar shuffle writes.

4.3 Q72 Analysis

Q72 did not benefit because it contains many consecutive joins that are better suited for Spark's Whole‑Stage Codegen; the query is driver‑fallbacked to vanilla Spark.

4.4 Try It Yourself

Users can enable the feature on Baidu Cloud BMR by setting spark.sql.execution.clickhouse=true. Currently only Parquet and ORC sources are supported.

Example code: https://github.com/copperybean/study-codes/blob/main/java/sum-ints.java

Computer Systems: A Programmer's Perspective – Figure 6.23

Loop unrolling: https://en.wikipedia.org/wiki/Loop_unrolling

ClickHouse JIT blog: https://clickhouse.com/blog/clickhouse-just-in-time-compiler-jit

Q23 SQL source: https://github.com/databricks/spark-sql-perf/blob/master/src/main/resources/tpcds_2_4/q23b.sql

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance OptimizationClickHouseSparkNative Acceleration
Baidu Intelligent Cloud Tech Hub
Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.