Accelerating Spark with ClickHouse: Native Optimization Techniques and Performance Evaluation
This article presents a comprehensive technical overview of using ClickHouse as a native backend to accelerate Spark SQL execution, covering Spark performance bottlenecks, ClickHouse's CPU‑level optimizations, the design and implementation of the Spark‑Native integration, and detailed TPC‑DS benchmark results demonstrating up to 3.5× speedup.
1 Spark Performance Optimization Background
Spark originated from UC Berkeley AMPLab and has become a leading open‑source engine for batch, streaming, and machine‑learning workloads. In Spark SQL, the driver generates a logical plan, optimizes it, and produces a physical plan that is split into stages and tasks, where the driver work is lightweight but tasks consume most of the compute resources.
1.1 Spark Overview
Typical SQL execution involves generating a logical plan, applying multiple optimization passes, and finally creating a physical plan that is divided into stages. Each stage consists of many tasks that run on executors, processing data row‑wise using JVM‑based operators.
1.2 Spark Task Characteristics
Tasks run on the JVM, which limits low‑level optimizations such as SIMD usage and efficient memory access. Row‑oriented data structures increase GC pressure and hinder vectorized execution.
1.3 Native Acceleration Idea
The proposed solution rewrites the task‑level execution path in C++ and stores data column‑wise, leveraging ClickHouse's high‑performance OLAP engine. By replacing Scan, Agg, and Shuffle operators with ClickHouse equivalents, the system can exploit columnar storage, SIMD, and loop‑unrolling techniques.
2 ClickHouse Performance Advantages
ClickHouse achieves superior CPU utilization through pipeline‑filled execution, SIMD vectorization, and aggressive loop unrolling. Columnar storage enables efficient prefetching and reduces memory bandwidth pressure. The engine also uses C++ templates and just‑in‑time compilation (JIT) to generate highly optimized machine code.
2.1 CPU Pipeline and SIMD
Modern CPUs execute instructions in multiple pipeline stages; keeping all stages busy maximizes throughput. SIMD allows a single instruction to process multiple data elements simultaneously, providing up to 4× speedup for simple arithmetic.
2.2 ClickHouse Code Example
For a simple aggregation query, ClickHouse generates C++ loops with explicit loop unrolling and SIMD intrinsics, as illustrated by the following SQL snippet:
select f1 + f2, f3, f4, f5, f6 * f7 from tbl2.3 Template‑Based Optimizations
ClickHouse heavily uses C++ templates to eliminate virtual function overhead and to produce type‑specific code paths, especially in hash‑based aggregation and join operators.
3 Spark Native Acceleration Design and Implementation
The integration follows two design principles: full compatibility with native Spark semantics and maximal performance gain. Execution plans are generated in the driver, serialized via Protobuf, and dispatched to a ClickHouse dynamic library through JNI. Custom operators such as SparkShuffleSink and SparkShuffleSource bridge Spark's shuffle framework with ClickHouse.
3.1 Execution Plan Serialization
Physical operators (e.g., FileScanExec ) are mapped to ClickHouse equivalents (e.g., DFSSource ) and their parameters are encoded in Protobuf messages, preserving stage boundaries and shuffle identifiers.
3.2 Decimal Type Compatibility
To ensure identical results, the Spark‑Native layer adjusts ClickHouse's decimal precision/scale inference to match Spark's Hive‑compatible rules.
3.3 Conditional Join Optimization
For semi‑joins with inequality predicates, the solution implements a custom code‑generated expression evaluator that avoids Cartesian product expansion, extending ClickHouse's JIT framework to support column‑index based inputs.
3.4 Fallback Mechanism
When an operator cannot be executed in ClickHouse, the system falls back to native Spark at the operator, stage, or driver level, guaranteeing that performance never degrades below the vanilla Spark baseline.
4 Acceleration Effect Analysis
Benchmarking on a 10 TB TPC‑DS workload shows an average 2.3× speedup, with some queries (e.g., Q23) achieving up to 3.5× improvement. Queries dominated by simple scans, hash joins, and columnar shuffle writes benefit most, while queries with many consecutive joins (e.g., Q72) fall back to Spark and see no gain.
4.1 TPC‑DS Performance Summary
The overall reduction in execution time translates to roughly halving cluster resource requirements for the same workload.
4.2 Practical Usage
Users can enable the feature in Baidu Cloud BMR by setting spark.sql.execution.clickhouse=true . Currently supported file formats are Parquet and ORC.
For more details, refer to the original presentation and the linked source code repositories.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.