Big Data 22 min read

How Zhihu Optimized Spark Jobs with Gluten: A Practical Deep‑Dive

This article details Zhihu's end‑to‑end experience of migrating Spark SQL workloads to the open‑source Gluten framework, covering background performance benchmarks, the architecture of Gluten and Velox, consistency and performance challenges encountered during migration, the concrete fixes applied, and the resulting resource savings and future plans.

DataFunTalk
DataFunTalk
DataFunTalk
How Zhihu Optimized Spark Jobs with Gluten: A Practical Deep‑Dive

Background

Databricks published the Photon paper at ACM SIGMOD 2022, showing a 4× speedup on TPC‑H 3000SF and an open‑source Gluten implementation achieving 2.7× on TPC‑H 2000SF. Zhihu's offline analytics platform runs primarily on Spark SQL with >10,000 daily jobs and constantly seeks performance and resource‑usage improvements. Initial Gluten experiments in 2023 revealed serious shuffle‑stability problems, so the team first focused on Celeborn‑based shuffle fixes and Spark upgrades. In Q1 2025 the team resumed a targeted Gluten migration on the most resource‑intensive jobs, paused after encountering many compatibility issues, and restarted the effort in Q1 2026 after learning from ByteDance and AntGroup case studies.

Gluten Overview

Gluten is a middle‑layer that offloads the JVM‑based compute part of Spark to native engines such as ClickHouse and Velox while preserving Spark’s distributed control flow. Its core capabilities are:

Reuse Spark’s distributed control flow.

Convert Spark physical plans to Substrait, then to native execution plans.

Offload performance‑critical data processing to native engines.

Velox Overview

Velox is an open‑source C++ execution engine that provides reusable, extensible, high‑performance data‑processing components. It supplies:

Vectorized scalar, aggregate and window functions that follow Presto and Spark semantics.

Relational operators (scan, write, projection, filter, join, shuffle, hash, merge, nested‑loop join, unnest, etc.).

I/O connectors for ORC/DWRF, Parquet, Nimble and storage adapters (S3, HDFS, GCS, ABFS, local).

A generic type system supporting primitive, complex and nested types.

Columnar memory layout compatible with Arrow, supporting flat, dictionary, constant, sequence/RLE encodings and lazy materialisation.

A fully vectorized expression evaluator that runs on top of the columnar layout.

Network serializers for PrestoPage and Spark UnsafeRow.

Resource‑management primitives (memory arena, buffer pool, thread pools, spilling, caching).

Why Gluten Is Faster Than Spark

Core compute logic runs in Velox as pure C++ code, avoiding JVM interpretation overhead; JIT optimisation only helps after many repetitions.

All data stays columnar throughout the pipeline, enabling vectorised execution (e.g., AVX‑512 can process 16 integers per instruction).

Column‑wise storage improves cache locality; CPUs can pre‑fetch contiguous column data, raising cache‑hit rates.

Batch processing reduces function‑call overhead compared with row‑wise loops.

Job Migration Process

Early validation used manual dual‑run, logging, and incremental migration. Later the team built an automated migration pipeline (diagram omitted) that performed the following steps for each job:

Run the original Spark job and the Gluten‑enabled version in parallel.

Compare results and resource usage.

If results are consistent and resource savings exceed 10 %, promote the Gluten version to production.

Data Consistency Issues

Unstable Spark functions (RowNumber, CollectSet, Rand, First, FirstValue, etc.) caused nondeterministic results; replaced with stable equivalents (e.g., RowNumber → Rank, CollectSet → SortArray, Limit → GlobalSort before Limit).

Differences between Spark and Gluten+Velox implementations of GetJsonObject produced mismatched JSON strings (extra spaces). The team upgraded the Simd library and added a compatibility layer in Velox.

Older Parquet (≤ 1.8.1) stored strings as signed bytes, while Velox used unsigned comparison, leading to incorrect index filtering; the fix skips index filtering for those versions.

Decimal handling in Velox stored values as BigInt and applied division after conversion, which could overflow; a patch corrected the conversion path.

NativeUnion operator mistakenly treated type names as field names, causing identical column values; the bug was fixed in Velox.

Array‑type columns with mismatched RowGroup batch sizes triggered null‑buffer reuse errors; a dedicated fix ensured proper buffer size updates.

CaseWhen expressions that called get_struct_field violated the contract of reusing the same result vector; the implementation now reuses the passed‑in vector when possible.

After applying these patches, the proportion of double‑run jobs with inconsistent results dropped from ~40 % to <10 %.

Performance Issues

Testing the top 200 resource‑heavy jobs revealed that ~40 % showed no benefit or even regression; ~20 % were negative, and ~20 % had <10 % improvement. Most problematic jobs were IO‑bound, spill‑heavy, or contained many Union operators. The team applied two main optimisations:

Disabled Gluten’s asynchronous prefetch (set spark.gluten.sql.columnar.backend.velox.IOThreads=0), which removed the shared IO thread‑pool bottleneck on slow HDFS nodes and restored IO throughput.

Enabled Velox’s prefix‑sort spill optimisation (set

spark.gluten.sql.columnar.backend.velox.spillPrefixsortEnabled=true

), reducing spill volume for several jobs.

These changes reduced the proportion of jobs with no clear benefit to roughly 15‑20 %.

Stability Issues

GCC vectorisation of the LZO decompression routine caused occasional CoreDump; disabling vectorisation for lzoDecompress eliminated the crash.

CoreDump in libhdfs arose from invoking JNI methods during thread‑destructor callbacks on JVM‑created threads; the issue is tracked by Hadoop JIRA HDFS‑16021.

Decimal storage format differences between Spark (legacy FIXED_LEN_BYTE_ARRAY) and Velox (auto‑selected Int32/Int64) triggered compatibility exceptions in downstream engines.

Driver memory pressure increased because Gluten collects many more task metrics than vanilla Spark; reducing metric collection or increasing driver memory resolved the out‑of‑memory failures.

Benefits and Future Plans

By focusing on the top‑resource jobs, Zhihu migrated 2,446 out of 3,200 Spark SQL jobs (≈ 76 % success). The migration saved roughly 130 k core‑hours (44.6 % reduction) and 280 k GB of memory (57 % reduction). The resource‑coverage now approaches 80 % of total Spark SQL consumption.

Continue fixing the remaining failed jobs (mainly performance‑related: operator fallback, spill‑heavy, Text‑format scans).

Explore data‑sync pipelines and Spark‑Jar job migration.

Investigate further Parquet write‑performance gains, leveraging Photon‑style optimisations.

Evaluate the feasibility of migrating Flink workloads to Gluten, noting that community benchmarks report up to 3× throughput improvement for Flink.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PerformanceOptimizationbig dataSparkVeloxGlutenResource Savings
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.