Evolution and Production Practices of Apache Celeborn Remote Shuffle Service at Bilibili
Bilibili replaced Spark’s unstable External Shuffle Service with a push‑based approach, then deployed Apache Celeborn’s remote shuffle on Kubernetes using HA masters, tiered workers, extensive monitoring, history‑based routing, chaos testing, and seamless Spark, Flink, and MapReduce integration, while planning self‑healing, elastic scaling, and priority‑aware I/O enhancements.
Background: With the rapid growth of Bilibili’s business, data volume has increased exponentially and the compute cluster has expanded from a single data center to a multi‑data‑center deployment. The big‑data platform relies on Spark, Flink, Presto and Hive, running over 300,000 Spark jobs daily and handling more than 30 PB of shuffle data. Shuffle stability is critical for the reliability and performance of offline jobs.
Early Local Shuffle: Bilibili initially used Spark’s External Shuffle Service (ESS). In production, ESS showed poor stability due to shared network bandwidth, high connection counts, random‑read I/O bottlenecks, lack of elastic scaling, and loss of shuffle data when the shuffle service crashed.
Push‑Based Shuffle Evolution: To reduce fetch‑failed exceptions, a push‑based shuffle was adopted. Its advantages include better support for Adaptive Query Execution (AQE) and reduced memory pressure. Drawbacks are write amplification and cold‑start problems.
Shuffle Service Master: A master node was introduced to register shuffle nodes, perform heartbeat monitoring, and dynamically allocate slots, alleviating cold‑start issues and improving performance. Dynamic task profiling enables RSS only for large tasks, achieving a 25 % reduction in average execution time for big jobs.
Celeborn Overview: Apache Celeborn, donated by Alibaba Cloud, is an open‑source Remote Shuffle Service that decouples shuffle data from compute nodes. It follows a client‑server architecture with Master, Worker, and Plugin components. The Master manages cluster state via Raft, Workers store and serve shuffle data with multi‑layer storage, and the Shuffle Client integrates with Spark, Flink and MapReduce.
Production Deployment: Celeborn clusters are deployed on Kubernetes with separate Master and Worker farms. Three Master clusters (each with three nodes) provide HA. Four Worker clusters (total 270 workers) use 6 TB NVMe SSDs, offering 50 Gbps network per machine and high I/O throughput. Workers are split into low‑priority and high‑priority pools to isolate workloads.
Monitoring & Meta‑Warehouse: Celeborn exposes metrics via Dropwizard, exporting to HTTP, JMX, CSV or Prometheus. Metrics are collected, sent to Kafka, and stored in StarRocks to build a meta‑warehouse that provides cluster‑level and job‑level resource consumption views.
Intelligent Routing: History‑Based Optimization (HBO) routes jobs to high‑ or low‑priority clusters based on historical shuffle traffic, job priority and current cluster load, ensuring baseline tasks receive sufficient resources.
Diagnosis & Governance: A diagnostic dashboard visualizes shuffle blockage rates, blocked time, and affected jobs. Shuffle stalls are identified when read/write blocked time exceeds 3 minutes and blocked rate >20 %.
Chaos Testing: A chaos framework (Scheduler, Runner, CLI) injects failures such as killing Master/Worker processes, disk I/O hangs, CPU overload, and metadata corruption to validate Celeborn’s resilience.
Engine Integration – Spark: Celeborn Spark client can be shipped independently via --jars or spark.celeborn.client.jar, avoiding tight coupling with Spark images. Issues like classloader mismatches are addressed. Data integrity checks compare Shuffle Records Written vs. Records Read.
Engine Integration – Flink: Celeborn supports Flink’s hybrid shuffle, job manager failover, and plans to add ReducePartition support. Decoupled Flink client jars are also targeted.
Engine Integration – MapReduce: Celeborn integrates with MapReduce through Yarn RMProxy & Federation, automatically routing MapReduce shuffle to Celeborn.
Future Outlook: Plans include fault‑self‑healing, elastic scaling of worker clusters, priority‑aware I/O scheduling, remote spilled data support, and continued community collaboration for performance optimizations and feature extensions.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.