Big Data 18 min read

Gluten Vectorized Engine: Boosting Spark Performance with Native Execution

The article introduces the Gluten vectorized engine, explains why Spark’s CPU bottleneck motivates integrating native vectorized back‑ends via Substrait, details its architecture, component design, current performance gains of up to three‑fold, and outlines ongoing development and future work.

DataFunSummit
DataFunSummit
DataFunSummit
Gluten Vectorized Engine: Boosting Spark Performance with Native Execution

Gluten is an open‑source vectorized engine designed to accelerate Apache Spark by offloading compute‑intensive operators to native execution engines, addressing the CPU bottleneck that limits Spark’s performance despite years of JVM‑based optimizations.

The project integrates native engines such as Velox, ClickHouse, and Apache Arrow through a Substrait plan translation layer, converting Spark’s physical plan into a language‑agnostic representation that native back‑ends can execute via JNI, while preserving Spark’s master/worker architecture.

Gluten’s architecture comprises several key components: a plan conversion module that injects extension rules to produce Substrait plans, memory management that leverages Spark’s unified memory manager to control off‑heap native allocations, a columnar shuffle manager for efficient column‑oriented data exchange, and a shim layer to support multiple Spark versions. Fallback mechanisms ensure unsupported operators revert to Spark’s native JVM execution.

Performance evaluations on TPCH benchmarks show that Gluten with Velox or ClickHouse back‑ends can double overall query latency and achieve up to 3.6× speed‑up for specific queries, demonstrating substantial gains over vanilla Spark.

Current development status includes full support for 22 TPCH queries on Velox and 21 on ClickHouse, with ongoing work to add TPC‑DS coverage, additional data types (Float, Binary, Decimal, complex types), cloud object‑store caching, and enhancements to the columnar shuffle manager.

The project is driven by Intel, Kyligence, Bigo, and the broader open‑source community, inviting contributions via its GitHub repository.

PerformanceBig DataVectorizationSparkGlutenNative EngineSubstrait
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.