Big Data 16 min read

How SF Tech Cut 10,000 CPU Cores with Apache Gluten – A Deep Dive

This article details how SF Technology adopted Apache Gluten with Velox to accelerate Spark queries, describing the architecture, task lifecycle, management framework, simulation system, unified SQL, fallback mechanisms, dynamic memory tuning, columnar shuffle, and future plans that together saved over 10,000 CPU cores and reduced operator fallback rates to around 4%.

SF Technology Team

Sep 29, 2025

How SF Tech Cut 10,000 CPU Cores with Apache Gluten – A Deep Dive

Since January 2024, SF Technology has been using Apache Gluten + Velox to accelerate Spark queries, deploying more than 16,000 Gluten tasks, saving over 10,000 CPU cores, and achieving performance improvements of more than 50% compared with native Spark while reducing operator fallback rates to about 4%.

Gluten works by injecting multiple rules into Spark, converting Spark’s physical plan to a Substrait plan, and passing it via JNI to a native engine (Velox or ClickHouse). The native engine executes the plan and returns columnar batches to Spark. This architecture yields 2‑14× speedups in TPC‑H benchmarks.

The Gluten task lifecycle at SF includes selecting candidate tasks from over 700,000 offline jobs (Hive/Spark) whose average vcore‑seconds in the last 7 days exceed 10,000, rewriting SQL for compatibility, running dual‑execution simulations (native vs. original engine), and promoting only tasks that pass correctness and consistency checks to production via a gray‑release system.

The management framework consists of four modules: a simulation system, a unified‑SQL layer (built on Calcite) that hides engine‑specific syntax, a fallback manager that automatically switches failed Gluten tasks back to their original engine, and a gray‑release system that decides per‑task engine and parameters via configuration.

The simulation system extracts the original SQL, rewrites it to isolate production and test catalogs, runs the query on both Gluten and the original engine, compares result sets row‑by‑row, and generates a report; only tasks with identical results are promoted.

The unified‑SQL layer enables a single SQL statement to run transparently on multiple engines (Hive, Spark, Presto, Doris, ClickHouse, Gluten) by applying custom Calcite rules and generating engine‑specific code.

Fallback management differs from Gluten’s built‑in fallback: when a Gluten task fails, it is immediately rerun on its original engine (Hive or Spark), ensuring no user impact.

The gray‑release system configures which engine a task uses at runtime, allowing seamless transitions between Hive, Spark, and Gluten, as well as blacklisting or de‑commissioning Gluten tasks.

Performance results show that since March 2025, with Whole‑Stage Fallback, dynamic memory adjustment, and columnar shuffle, Gluten has saved more than 10,000 CU, achieved a CPU‑ratio of 50% versus native Spark, and runs about 16,000 tasks daily.

Whole‑Stage Fallback automatically rolls back an entire stage to Spark when the number of operator fallbacks exceeds a threshold; after tuning the threshold and memory settings, performance doubled and fallback rates dropped to ~20%.

Dynamic memory adjustment reallocates off‑heap memory to on‑heap based on per‑stage operator fallback ratios, keeping total memory constant while preventing out‑of‑memory failures.

Columnar Shuffle, powered by Gazelle, replaces Spark’s row‑based shuffle with a columnar approach, reducing data movement overhead and improving compression; it integrates with RSS (Uniffle, Celeborn) for further gains.

Future work includes extending Gluten to support Flink tasks (targeting >2× speedup), reducing the remaining ~20% fallback rate by adding missing operators and functions, and adding native support for TextFile tables to avoid costly row‑to‑column conversions.