How Celeborn Transformed Spark Shuffle Performance at ZTO Express
Facing massive daily Spark shuffle volumes and unstable ETL performance, ZTO Express migrated from the community External Shuffle Service to Celeborn's Remote Shuffle Service, achieving higher disk I/O efficiency, better reliability, reduced network connections, and significant reductions in task failures and job latency.
As ZTO Express’s business grew, its data platform relied heavily on Hadoop and Spark‑SQL, running over 120,000 Spark tasks daily and generating more than 6 PB of shuffle data per day. The existing External Shuffle Service (ESS) embedded in NodeManager caused heavy disk I/O, random reads/writes, and network connections proportional to the product of mappers and reducers, leading to frequent shuffle failures and long ETL latency.
Limitations of ESS
ESS writes map output directly to local disks; if a disk fails or a node crashes, the data is lost, forcing stage recomputation. The shuffle read phase also creates M × N network connections, saturating I/O and causing fetch failures.
Adopting Spark Remote Shuffle Service (RSS) – Celeborn
RSS uses a push‑based model where map outputs are pushed to remote shuffle servers and merged by partition. Reducers then read the merged files from these servers.
Key Benefits of Celeborn
Improved Disk I/O Efficiency : Switches from many small random reads/writes to larger sequential reads, boosting throughput.
Enhanced Reliability : Provides multiple shuffle data replicas, preventing data loss when nodes fail.
Reduced Network Connections : Linear relationship between mapper and reducer counts cuts total connections roughly in half.
Native Spark 2.x AQE Support : Leverages Intel‑based Spark 2.3.2 AQE features, simplifying integration.
Broad Engine Compatibility : Supports Spark, MapReduce, Flink Batch and other offline engines.
Disk‑Load Awareness : Dynamically allocates disk slots across fast and slow disks, balancing I/O across heterogeneous hardware.
Intelligent Flow Control : Implements TCP‑like congestion control and credit‑based traffic shaping to protect worker memory.
Operational Issues and Fixes
During deep testing of Celeborn 0.3.0, several problems were identified and resolved:
Netty Thread‑Local Cache caused memory pressure; patches CELEBORN‑796 and CELEBORN‑897 disabled the cache.
Inconsistent fileInfos and on‑disk shuffle files led to premature deletions; a bug fix was contributed (PR CELEBORN‑1005).
Expired‑app cleanup overlapping with shuffle cleanup deleted active data; adding a 12‑hour age check prevented this (PR CELEBORN‑1046).
Bad disks triggered NPEs during cleanup; filtering out unhealthy disks eliminated the error (PR CELEBORN‑1103).
Driver OOM from massive partition splits was mitigated by raising the split‑size threshold from 1 GB to 10 GB.
Load‑aware slot allocation sometimes excluded machines due to high fetch‑time thresholds; lowering celeborn.worker.diskTime.slidingWindow.minFetchCount to 0 restored balanced slot distribution.
Added JSON metric export to complement native Prometheus support (PR CELEBORN‑1122).
Performance Impact
After deploying Celeborn across the cluster, shuffle‑related task failures dropped from ~46 000 to about 20, a reduction of over 2 300×. Network connections per machine fell from ~40 000 to ~20 000. Disk utilization became balanced, with both fast‑disk and large‑disk machines operating around 55 % usage. Core wide‑table jobs accelerated by more than 90 minutes, moving average completion time from 4:30 am to around 3:00 am. Overall, Celeborn markedly improved Spark shuffle efficiency, reliability, and stability, and plans are underway to extend the service to MR and Flink workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Zhongtong Tech
Integrating industry and information for digital efficiency, advancing Zhongtong Express's high-quality development through digitalization. This is the public channel of Zhongtong's tech team, delivering internal tech insights, product news, job openings, and event updates. Stay tuned!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
