Boost Spark Performance with ClickHouse: Native Acceleration Techniques
This article presents a detailed technical overview of accelerating Spark's compute engine using ClickHouse as a native backend, covering Spark performance background, ClickHouse's advantages, the design and implementation of a Spark‑Native acceleration solution, and extensive performance evaluation results.
1 Spark Performance Optimization Background
Spark originated as an experimental project at UC Berkeley AMPLab, initially targeting iterative machine‑learning workloads with a distributed approach. Contributed to Apache in 2013, it has become one of the most popular open‑source big‑data platforms, widely used for offline processing, data‑science, machine‑learning, and streaming.
1.1 Spark Overview
Spark follows a driver‑executor model. The driver generates a logical plan, optimizes it through multiple stages, and produces a physical plan that is split into stages and tasks. Each task processes a data partition on an executor.
1.2 Spark SQL Overview
A typical SQL query is parsed into a logical plan, optimized, and then turned into a physical plan consisting of stages and tasks. The driver creates stages such as ShuffleMap and Result, each containing thousands of tasks that execute a fixed set of SQL operators (e.g., Scan, Agg, Shuffle).
Driver performs complex planning with minimal resource consumption (milliseconds).
Tasks consume the majority of CPU and memory resources.
1.3 Spark Task Overview
Tasks run inside the JVM and use a row‑based execution model. This design limits performance because JVM code cannot directly use SIMD instructions, suffers from heap‑based memory management overhead, and generates less efficient machine code.
For example, a Java implementation of a one‑million‑integer sum is about 11× slower than an equivalent C++ version that benefits from loop unrolling and SIMD.
1.4 Spark Task Performance Optimization Idea
The key idea is to replace Spark's row‑oriented execution with a column‑oriented approach implemented in C++. By rewriting the Scan, Agg, and Shuffle operators in C++ and using columnar data layouts, significant speedups can be achieved.
1.5 ClickHouse as Spark SQL Backend
ClickHouse is a mature OLAP engine with comprehensive SQL support. Most Spark SQL operators have equivalents in ClickHouse, allowing us to replace the task‑level data processing with a ClickHouse dynamic library invoked via JNI. Integration also requires handling RDD deserialization, shuffle coordination, broadcast data, and metrics.
2 ClickHouse Performance Advantages
2.1 CPU Pipeline – Foundations of Vectorization
Modern CPUs execute instructions through multiple pipeline stages (Fetch, Decode, Execute, Write‑back). Fully utilizing the pipeline maximizes throughput. Techniques such as prefetching and preloading data into caches are essential for high performance.
2.2 SIMD Introduction
SIMD (Single Instruction Multiple Data) allows a single instruction to operate on multiple data elements simultaneously (e.g., 128‑bit registers can process four 32‑bit integers at once), providing up to 4× speedup for vectorizable workloads.
select f1 + f2, f3, f4, f5, f6 * f7 from tbl2.3 Native Techniques in ClickHouse
ClickHouse leverages loop unrolling and SIMD in its C++ kernels. For a simple aggregation query, the generated assembly shows use of xmm registers and the paddq SIMD instruction, achieving roughly four times the performance of scalar code.
2.4 C++ Templates in ClickHouse
ClickHouse extensively uses C++ templates and CRTP to eliminate virtual function overhead while providing type‑safe abstractions for hash tables and aggregation.
3 Spark Native Acceleration Design and Implementation
3.1 Design Principles
Maintain full compatibility with vanilla Spark (identical SQL semantics, minimal user configuration).
Maximize performance gains by minimizing data format conversions and keeping most processing in ClickHouse.
3.2 Execution Plan Integration
The driver generates a physical plan, which is serialized (via Protobuf) and sent to ClickHouse through JNI. Each Spark stage is represented by a single ClickHouseRDD that encapsulates all SQL operators for that stage.
3.3 SQL Semantic Compatibility – Decimal Types
Spark and ClickHouse differ in precision/scale rules for Decimal arithmetic. The native solution adjusts ClickHouse's type‑derivation logic to match Spark's semantics without sacrificing performance.
3.4 Shuffle Framework Compatibility
ClickHouse implements SparkShuffleSink and SparkShuffleSource operators to handle map‑side partitioning, index/file generation, and shuffle reads, reusing Spark's existing shuffle infrastructure.
3.5 Performance Optimization – Conditional Join
For semi‑joins with inequality predicates, the solution extends ClickHouse's codegen to evaluate expressions directly on column indices, avoiding costly Cartesian product materialization.
3.6 Fallback Mechanism
When an operator or stage cannot be accelerated, three fallback strategies are employed: Task‑level fallback, Stage‑level fallback, and Driver‑level fallback, ensuring that Spark's original performance is never degraded.
4 Acceleration Effect Analysis
4.1 TPC‑DS Performance
On a 10 TB TPC‑DS benchmark, Spark Native achieved a 2.3× overall speedup compared to vanilla Spark, halving cluster cost or runtime.
4.2 Detailed Q23 Analysis
Q23 showed a 3.5× improvement due to faster scans, vectorized joins, and columnar shuffle writes.
4.3 Q72 Analysis
Q72 did not benefit because it contains many consecutive joins that are better suited for Spark's Whole‑Stage Codegen; the query is driver‑fallbacked to vanilla Spark.
4.4 Try It Yourself
Users can enable the feature on Baidu Cloud BMR by setting spark.sql.execution.clickhouse=true. Currently only Parquet and ORC sources are supported.
Example code: https://github.com/copperybean/study-codes/blob/main/java/sum-ints.java
Computer Systems: A Programmer's Perspective – Figure 6.23
Loop unrolling: https://en.wikipedia.org/wiki/Loop_unrolling
ClickHouse JIT blog: https://clickhouse.com/blog/clickhouse-just-in-time-compiler-jit
Q23 SQL source: https://github.com/databricks/spark-sql-perf/blob/master/src/main/resources/tpcds_2_4/q23b.sql
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
