Big Data 17 min read

Practical Experience with Apache Kyuubi and Apache Celeborn in Big Data Platforms

This article shares detailed practical experiences from DingXiangYuan's big‑data platform on using Apache Kyuubi and Apache Celeborn, covering architecture, flexible configuration, AuthZ fine‑grained permissions, small‑file and Z‑Order optimizations, Arrow‑based large result transmission, and operational tips such as connection‑level issues and Netty cache handling.

DataFunTalk

Jul 23, 2024

Practical Experience with Apache Kyuubi and Apache Celeborn in Big Data Platforms

The presentation introduces DingXiangYuan's big‑data foundation platform built on Apache Kyuubi and Apache Celeborn, outlining two main parts: Kyuubi for unified Spark entry and Celeborn for efficient shuffle services.

Kyuubi Overview – Kyuubi unifies Spark program entry, supporting Hive Beeline, RESTful API, multi‑tenant isolation, and various plugins (Z‑Order, small‑file merge, lineage, audit). It improves YARN resource utilization through share levels.

Flexible Configuration – Four ways to adjust parameters: global config via kyuubi-default.conf or spark-default.conf, JDBC URL suffix, SET syntax at runtime, and the org.apache.kyuubi.plugin.SessionConfAdvisor plugin for session‑level overrides. Configurations are categorized into Kyuubi‑controlled Spark launch parameters, static Spark engine parameters, and dynamic Spark runtime parameters.

AuthZ Plugin – Provides table/column level fine‑grained control, row‑level control, and data masking. It integrates with Spark Catalyst: table/column rules in the Optimizer stage, row‑level and masking in the Analyzer stage, and currently fetches policy from Ranger.

Small‑File and Z‑Order Optimizations – Uses Spark AQE to split large reduce tasks and merges small files. Z‑Order sorting reduces scanned files (e.g., from 9 to 7 for a point query) and improves compression. Three Z‑Value calculation schemes are discussed, and experiments show significant storage reduction and query‑scan improvements.

Connection‑Level Issues – High concurrency can cause CPU/memory spikes; solution includes enlarging Kyuubi server pod limits and client‑side concurrency control. Reference to related GitHub issue is provided.

Arrow Large Result Transmission – Kyuubi 1.7.0 adds Arrow serialization via kyuubi.operation.result.format=arrow. Arrow reduces copy overhead and moves serialization to executors, achieving up to 2× performance over Thrift for large result sets.

Celeborn vs. External Shuffle Service – Celeborn offers asynchronous push, flush, commit, and fetch, yielding 7%–20% performance gains over ESS. Pipeline feature further improves throughput but may cause illegal memory access.

Additional Features – Netty cache disabling reduces memory usage, stage recompute for worker failures, local shuffle read when worker and Spark share a cluster, Hadoop MapReduce support, memory storage proposal, authentication, and support for Scala 2.13 and JDK 17.

The article concludes with a thank‑you note and references to related talks and resources.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data configuration Spark Apache Celeborn Apache Kyuubi Z-Order Arrow

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.