Blaze: A Native Vectorized Execution Engine for Spark – Architecture, Production Optimizations, and Future Plans
Blaze is Kuaishou's self‑developed native execution engine that leverages Rust, DataFusion, and SIMD vectorization to accelerate Spark workloads, offering a 30%+ compute boost, detailed architectural components, deep production‑grade optimizations, and a roadmap for broader adoption.
Blaze is Kuaishou's self‑developed native execution engine built on Rust and the DataFusion framework, designed to exploit SIMD vectorization and native code for Spark workloads, achieving about a 30% increase in compute performance.
The presentation covers three main topics: the principles and architecture of Blaze, production‑grade deep optimizations, and current progress with future plans.
1. Blaze Principles and Architecture Design
To understand Blaze, we first review Spark's evolution. Spark 1.0 used an interpreted execution model with high overhead. Spark 2.0 introduced WholeStageCodegen, compiling multiple operators into a single function, improving runtime efficiency. Spark 3.0 added Adaptive Query Execution (AQE), enabling dynamic plan optimization at runtime.
Vectorized execution, a widely accepted direction, processes data in row groups using SIMD instructions, improving cache friendliness and leveraging columnar storage formats like Parquet.
Blaze combines Spark's distributed processing with DataFusion's native vectorized operators. Its architecture includes four core modules:
Native Engine: DataFusion‑based native operators, memory management, and FFI support.
ProtoBuf: Defines the operator description protocol between JVM and native side.
JNI Bridge: Enables mutual calls between Spark extensions and the native engine.
Spark Extension: Translates Spark operators to native operators.
The execution flow translates Spark's physical plan into a native plan, submits it, and runs it on the native engine.
2. Production‑Grade Deep Optimizations
FailBack Mechanism : Implements fine‑grained fallback for operators or UDFs lacking native implementations, bridging JVM and native execution via Arrow FFI.
CBO‑Based Conversion Strategy : Uses cost‑based rules to avoid conversions that would increase row‑column transformation overhead, preserving or improving performance.
Efficient Vectorized Data Transfer Format : Replaces Arrow’s row‑based serialization with a custom byte‑transpose columnar format, reducing redundancy and improving compression, leading to lower shuffle data volume.
Multi‑Level Memory Management : Coordinates heap and off‑heap memory, providing adaptive spill strategies and multi‑stage memory handling to ensure stability in production.
Improved Aggregation Algorithm : Implements bucket‑based merge aggregation with O(n) complexity, outperforming Spark's default sort‑based spill handling.
Expression Reuse Optimization : Merges operators with duplicate expressions, caches intermediate results, and reduces repeated computation, often doubling execution speed for complex expressions.
3. Current Progress and Future Plans
Blaze now supports vectorized Parquet I/O, a full set of common operators and expressions, and an internal Remote Shuffle Service. In TPC‑H benchmarks, Blaze achieves a 2.8× speedup across all 22 test cases, and in production it delivers over 30% compute improvement for ad‑hoc workloads and covers 40% of ETL tasks.
Future directions include continuous optimization and full production rollout, extending support to more engines and data‑lake scenarios, and building an open‑source community. The project is hosted at https://github.com/kwai/blaze with over 934 stars and active contributions.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.