Big Data 14 min read

Apache Celeborn: Overview, Architecture, Community, and Future Roadmap

This article introduces Apache Celeborn, explains the challenges of intermediate data in large‑scale compute engines, details its core architecture and design—including master, worker, lifecycle manager and shuffle client—covers its community history, version releases, performance comparisons with Spark ESS, real‑world deployment scenarios, and outlines future development plans.

DataFunTalk
DataFunTalk
DataFunTalk
Apache Celeborn: Overview, Architecture, Community, and Future Roadmap

Apache Celeborn is an open‑source project focused on handling intermediate shuffle data generated by large‑scale data processing engines such as Spark and Flink, aiming to improve the efficiency and stability of data flow.

The article first describes the challenges of intermediate data, including the high IOPS caused by many small random reads during the shuffle read phase and the storage pressure on local disks, which hinder elastic scaling and resource release in cloud‑native environments.

Celeborn’s architecture follows a server‑client model with a Master node managing cluster state and load balancing via Raft, Worker nodes storing and serving shuffle data on local disks and DFS, a Lifecycle Manager handling shuffle metadata, and a Shuffle Client embedded in each executor or task manager to push and pull data.

Key design points include fault tolerance (push retries and commit verification) and exactly‑once guarantees using a three‑field header (MapId, AttemptId, BatchId) to avoid duplicate data.

The Shuffle Client provides four APIs—pushData (including registerShuffle), mapperEnd, readPartition, and unregisterShuffle—allowing compute engines to integrate Celeborn without engine‑specific dependencies.

Performance evaluations show Celeborn outperforms Spark’s External Shuffle Service (ESS) in shuffle read throughput, especially at larger scales, while maintaining comparable write performance.

The community timeline covers its origin in Alibaba Cloud (2020), open‑source release (Dec 2021), donation to the Apache Software Foundation (Oct 2022), and subsequent releases (0.1, 0.2, 0.3) adding K8s deployment, support for Flink batch, native Spark, HDFS storage, and graceful upgrades.

Deployment models include mixed deployment with existing HDFS/YARN clusters, independent Celeborn clusters, and fully separated compute, Celeborn, and storage clusters, each offering different trade‑offs in isolation and scalability.

Real‑world production usage spans multiple large enterprises handling petabytes of shuffle data and tens of thousands of jobs daily.

Future plans focus on expanding the community, attracting more users (especially overseas), and adding features such as Spill/Cache support, integration with more engines (Tez, Trino, Ray, etc.), multi‑layer storage enhancements, security isolation, stage rerun, better AQE support, and a C++ SDK for native integrations.

Data EngineeringBig DataFlinkperformance evaluationSparkApache CelebornShuffle Service
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.