Big Data 19 min read

Apache Spark Adaptive Query Execution and Kyuubi Optimization Practices for Data Warehousing

This article presents a detailed overview of Apache Spark's Adaptive Query Execution evolution, its optimization techniques, and performance gains, followed by an in‑depth discussion of Apache Kyuubi's architecture, security integrations, cloud‑native capabilities, and practical Rebalance + Z‑Order strategies that enhance data‑warehouse task efficiency and query performance.

DataFunSummit

Sep 27, 2022

Apache Spark Adaptive Query Execution and Kyuubi Optimization Practices for Data Warehousing

Part 1: Apache Spark Adaptive Query Execution (AQE)

The talk begins with the history of AQE, tracing its origins from Spark 2.x (a simple, buggy prototype) to Spark 3.0 where Intel introduced a richer framework enabling shuffle‑reader optimizations such as join skew handling and local shuffle reader improvements.

Key optimizations include:

Coalescing small reduce partitions to reduce task count and eliminate empty partitions.

Dynamic join selection (broadcast, sort‑merge, hash join) based on runtime statistics, improving join performance by 2‑30%.

Optimizing skewed reduce partitions through repartitioning and balancing.

Netherlands’ internal contributions have merged over 40 patches into the Spark community, enabling AQE by default in Spark 3.1 and 3.2, delivering near‑100% performance improvement on TPC‑DS benchmarks.

Part 2: Kyuubi + Spark for Data‑Warehouse Tasks

Kyuubi provides a multi‑tenant, cloud‑native gateway for Spark, supporting Thrift, JDBC, REST, and Kerberos/LDAP authentication, with extensions for Ranger‑based row‑ and column‑level security.

Security features include Kerberos integration (keytab and proxy), long‑lived proxy tokens, and fine‑grained data masking.

The architecture enables routing to both Kubernetes and YARN clusters, allowing flexible deployment.

Optimization Practices

To improve data‑output quality, the presentation advocates replacing traditional Distribute By + Local Sort with a Rebalance + Z‑Order strategy under AQE, achieving:

Reduced small‑file generation and better file size alignment (≈200‑300 MB).

Higher compression ratios and improved data skipping for downstream engines.

Cross‑engine query performance gains (Spark, Impala, Hive).

Internal benchmarks show significant reductions in output data volume and file count, as well as consistent query speedups across engines.

Q&A Highlights

Kyuubi replaces Spark Thrift Server, not Spark itself.

Security plugins (Ranger) and AQE extensions are available out‑of‑the‑box.

Rebalance + Z‑Order is open‑source since Kyuubi 1.4.

Scala code execution is supported via interactive sessions (1.5) and upcoming batch Jar support (1.6).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Apache Spark Kyuubi Big Data Optimization SQL Performance Adaptive Query Execution Z-Order Rebalance

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.