Why Spark on Kubernetes Needs a Remote Shuffle Service—and How It Boosts Performance
This article examines the challenges of running Spark on Kubernetes, introduces the Remote Shuffle Service architecture to overcome shuffle bottlenecks, details EMR on ACK integration, showcases performance gains with Terasort benchmarks, and outlines future cloud‑native big‑data strategies such as mixed‑cluster and serverless deployments.
Cloud‑Native Background and Motivation
In the era of big data, traditional database middleware such as Oracle can no longer meet the demands of digital transformation. Spark has become a preferred batch‑processing engine, and many enterprises are moving online services to Kubernetes, seeking a unified, complete big‑data infrastructure.
EMR Architecture Overview
The classic EMR solution built on ECS consists of multiple layers:
ECS physical resource layer (IaaS).
Data ingestion layer (e.g., Kafka for streaming, Sqoop for batch).
Storage layer (HDFS, OSS, and EMR’s JindoFS cache).
Compute engine layer (Spark, Presto, Flink, etc.).
Data application layer (DataWorks, PAI, Zeppelin, Jupyter).
These layers form a comprehensive open‑source big‑data stack.
Why Move to Kubernetes
Customers increasingly deploy online services on Kubernetes (ACK in Alibaba Cloud) and want a unified, cloud‑native big‑data platform. Desired characteristics include elastic scaling, storage‑compute separation, and mixed workloads (online, batch, AI) on a single platform.
EMR Compute Engine on ACK
EMR on ACK (Alibaba’s Kubernetes service) packages Spark, Presto, and other engines as CRD + Operator components. Service components (e.g., Hive Metastore) and batch components (e.g., Spark) run as native Kubernetes workloads, integrated with Alibaba Cloud services for logging, monitoring, and governance.
Challenges of Spark on Kubernetes
Shuffle bottleneck: traditional shuffle relies on local disks, preventing dynamic resource scaling and causing heavy I/O.
Scheduling and queue management: ensuring high‑throughput job launches without performance bottlenecks.
Object‑store I/O: reading/writing OSS/HDFS incurs high rename/list overhead and bandwidth limits.
Remote Shuffle Service Solution
The Remote Shuffle Service (RSS) decouples shuffle data from executor disks. Executors write shuffle output over the network to RSS, which merges partitions and stores them in a distributed file system. Reducers then read sequential files, eliminating random I/O and improving stability.
Key benefits:
Network‑based shuffle with storage‑compute separation.
DFS replication (2×) removes fetch‑failed retries, stabilizing heavy‑shuffle jobs.
Sequential reads in the reduce phase avoid random I/O, dramatically boosting performance.
Performance Evaluation
Terasort benchmarks (2 TB, 4 TB, 10 TB) show RSS reduces total runtime, especially in the reduce stage. For 10 TB, reduce time drops from 1.6 h to 1 h, because RSS replaces M×N random I/O with N sequential reads.
Additional Optimizations
Scheduler improvements: moving Spark‑Operator logic into the Spark core to avoid webhook overhead.
OSS performance: JindoFS cache and Jindo Job Committer reduce rename/list costs.
Integration of Delta Lake and TPC‑DS optimizations for better Spark performance.
Future Outlook
Upcoming directions include hybrid clusters (fixed + serverless), spot‑instance scheduling for cost reduction, deeper storage‑compute separation via RSS, and stronger queue‑management capabilities to eliminate perceived performance limits.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
