Gluten Vectorized Engine: Boosting Spark Performance with Native Execution
The article introduces the Gluten vectorized engine, explains why Spark’s CPU bottleneck motivates integrating native vectorized back‑ends via Substrait, details its architecture, component design, current performance gains of up to three‑fold, and outlines ongoing development and future work.
Gluten is an open‑source vectorized engine designed to accelerate Apache Spark by offloading compute‑intensive operators to native execution engines, addressing the CPU bottleneck that limits Spark’s performance despite years of JVM‑based optimizations.
The project integrates native engines such as Velox, ClickHouse, and Apache Arrow through a Substrait plan translation layer, converting Spark’s physical plan into a language‑agnostic representation that native back‑ends can execute via JNI, while preserving Spark’s master/worker architecture.
Gluten’s architecture comprises several key components: a plan conversion module that injects extension rules to produce Substrait plans, memory management that leverages Spark’s unified memory manager to control off‑heap native allocations, a columnar shuffle manager for efficient column‑oriented data exchange, and a shim layer to support multiple Spark versions. Fallback mechanisms ensure unsupported operators revert to Spark’s native JVM execution.
Performance evaluations on TPCH benchmarks show that Gluten with Velox or ClickHouse back‑ends can double overall query latency and achieve up to 3.6× speed‑up for specific queries, demonstrating substantial gains over vanilla Spark.
Current development status includes full support for 22 TPCH queries on Velox and 21 on ClickHouse, with ongoing work to add TPC‑DS coverage, additional data types (Float, Binary, Decimal, complex types), cloud object‑store caching, and enhancements to the columnar shuffle manager.
The project is driven by Intel, Kyligence, Bigo, and the broader open‑source community, inviting contributions via its GitHub repository.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.