Apache Spark Adaptive Query Execution and Kyuubi Optimization Practices for Data Warehousing
This article presents a detailed overview of Apache Spark's Adaptive Query Execution evolution, its optimization techniques, and performance gains, followed by an in‑depth discussion of Apache Kyuubi's architecture, security integrations, cloud‑native capabilities, and practical Rebalance + Z‑Order strategies that enhance data‑warehouse task efficiency and query performance.
Part 1: Apache Spark Adaptive Query Execution (AQE)
The talk begins with the history of AQE, tracing its origins from Spark 2.x (a simple, buggy prototype) to Spark 3.0 where Intel introduced a richer framework enabling shuffle‑reader optimizations such as join skew handling and local shuffle reader improvements.
Key optimizations include:
Coalescing small reduce partitions to reduce task count and eliminate empty partitions.
Dynamic join selection (broadcast, sort‑merge, hash join) based on runtime statistics, improving join performance by 2‑30%.
Optimizing skewed reduce partitions through repartitioning and balancing.
Netherlands’ internal contributions have merged over 40 patches into the Spark community, enabling AQE by default in Spark 3.1 and 3.2, delivering near‑100% performance improvement on TPC‑DS benchmarks.
Part 2: Kyuubi + Spark for Data‑Warehouse Tasks
Kyuubi provides a multi‑tenant, cloud‑native gateway for Spark, supporting Thrift, JDBC, REST, and Kerberos/LDAP authentication, with extensions for Ranger‑based row‑ and column‑level security.
Security features include Kerberos integration (keytab and proxy), long‑lived proxy tokens, and fine‑grained data masking.
The architecture enables routing to both Kubernetes and YARN clusters, allowing flexible deployment.
Optimization Practices
To improve data‑output quality, the presentation advocates replacing traditional Distribute By + Local Sort with a Rebalance + Z‑Order strategy under AQE, achieving:
Reduced small‑file generation and better file size alignment (≈200‑300 MB).
Higher compression ratios and improved data skipping for downstream engines.
Cross‑engine query performance gains (Spark, Impala, Hive).
Internal benchmarks show significant reductions in output data volume and file count, as well as consistent query speedups across engines.
Q&A Highlights
Kyuubi replaces Spark Thrift Server, not Spark itself.
Security plugins (Ranger) and AQE extensions are available out‑of‑the‑box.
Rebalance + Z‑Order is open‑source since Kyuubi 1.4.
Scala code execution is supported via interactive sessions (1.5) and upcoming batch Jar support (1.6).
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.