Big Data 17 min read

Practical Experience with Apache Kyuubi and Apache Celeborn in Big Data Platforms

This article shares detailed practical experiences from DingXiangYuan's big‑data platform on using Apache Kyuubi and Apache Celeborn, covering architecture, flexible configuration, AuthZ fine‑grained permissions, small‑file and Z‑Order optimizations, Arrow‑based large result transmission, and operational tips such as connection‑level issues and Netty cache handling.

DataFunTalk
DataFunTalk
DataFunTalk
Practical Experience with Apache Kyuubi and Apache Celeborn in Big Data Platforms

The presentation introduces DingXiangYuan's big‑data foundation platform built on Apache Kyuubi and Apache Celeborn, outlining two main parts: Kyuubi for unified Spark entry and Celeborn for efficient shuffle services.

Kyuubi Overview – Kyuubi unifies Spark program entry, supporting Hive Beeline, RESTful API, multi‑tenant isolation, and various plugins (Z‑Order, small‑file merge, lineage, audit). It improves YARN resource utilization through share levels.

Flexible Configuration – Four ways to adjust parameters: global config via kyuubi-default.conf or spark-default.conf , JDBC URL suffix, SET syntax at runtime, and the org.apache.kyuubi.plugin.SessionConfAdvisor plugin for session‑level overrides. Configurations are categorized into Kyuubi‑controlled Spark launch parameters, static Spark engine parameters, and dynamic Spark runtime parameters.

AuthZ Plugin – Provides table/column level fine‑grained control, row‑level control, and data masking. It integrates with Spark Catalyst: table/column rules in the Optimizer stage, row‑level and masking in the Analyzer stage, and currently fetches policy from Ranger.

Small‑File and Z‑Order Optimizations – Uses Spark AQE to split large reduce tasks and merges small files. Z‑Order sorting reduces scanned files (e.g., from 9 to 7 for a point query) and improves compression. Three Z‑Value calculation schemes are discussed, and experiments show significant storage reduction and query‑scan improvements.

Connection‑Level Issues – High concurrency can cause CPU/memory spikes; solution includes enlarging Kyuubi server pod limits and client‑side concurrency control. Reference to related GitHub issue is provided.

Arrow Large Result Transmission – Kyuubi 1.7.0 adds Arrow serialization via kyuubi.operation.result.format=arrow . Arrow reduces copy overhead and moves serialization to executors, achieving up to 2× performance over Thrift for large result sets.

Celeborn vs. External Shuffle Service – Celeborn offers asynchronous push, flush, commit, and fetch, yielding 7%–20% performance gains over ESS. Pipeline feature further improves throughput but may cause illegal memory access.

Additional Features – Netty cache disabling reduces memory usage, stage recompute for worker failures, local shuffle read when worker and Spark share a cluster, Hadoop MapReduce support, memory storage proposal, authentication, and support for Scala 2.13 and JDK 17.

The article concludes with a thank‑you note and references to related talks and resources.

Big DataconfigurationSparkApache CelebornApache KyuubiZ-OrderARROW
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.