Big Data 18 min read

Gluten Vectorized Engine: Boosting Spark Performance with Native Execution

The article introduces the Gluten vectorized engine, explains why Spark’s CPU bottleneck motivates integrating native vectorized back‑ends via Substrait, details its architecture, component design, current performance gains of up to three‑fold, and outlines ongoing development and future work.

DataFunSummit

Mar 29, 2023

Gluten Vectorized Engine: Boosting Spark Performance with Native Execution

Gluten is an open‑source vectorized engine designed to accelerate Apache Spark by offloading compute‑intensive operators to native execution engines, addressing the CPU bottleneck that limits Spark’s performance despite years of JVM‑based optimizations.

The project integrates native engines such as Velox, ClickHouse, and Apache Arrow through a Substrait plan translation layer, converting Spark’s physical plan into a language‑agnostic representation that native back‑ends can execute via JNI, while preserving Spark’s master/worker architecture.

Gluten’s architecture comprises several key components: a plan conversion module that injects extension rules to produce Substrait plans, memory management that leverages Spark’s unified memory manager to control off‑heap native allocations, a columnar shuffle manager for efficient column‑oriented data exchange, and a shim layer to support multiple Spark versions. Fallback mechanisms ensure unsupported operators revert to Spark’s native JVM execution.

Performance evaluations on TPCH benchmarks show that Gluten with Velox or ClickHouse back‑ends can double overall query latency and achieve up to 3.6× speed‑up for specific queries, demonstrating substantial gains over vanilla Spark.

Current development status includes full support for 22 TPCH queries on Velox and 21 on ClickHouse, with ongoing work to add TPC‑DS coverage, additional data types (Float, Binary, Decimal, complex types), cloud object‑store caching, and enhancements to the columnar shuffle manager.

The project is driven by Intel, Kyligence, Bigo, and the broader open‑source community, inviting contributions via its GitHub repository.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance vectorization Spark Gluten Native Engine Substrait

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.