Big Data 17 min read

Design, Implementation, and Performance Evaluation of JD's Remote Shuffle Service for Spark

This article describes JD's research and production deployment of a self‑developed Remote Shuffle Service for Spark, covering its motivations, architectural design, cloud‑native features, monitoring, performance benchmarks against external shuffle solutions, and a real‑world promotion‑period case study that demonstrates improved stability and resource efficiency.

JD Retail Technology
JD Retail Technology
JD Retail Technology
Design, Implementation, and Performance Evaluation of JD's Remote Shuffle Service for Spark

Introduction The JD Spark computing engine team developed a Remote Shuffle Service (RSS) to address limitations of the community External Shuffle Service (ESS) in large‑scale promotional scenarios, aiming to improve performance, stability, and resource utilization.

Problems with ESS The existing ESS suffers from unclear architecture (tied to executors on the same NodeManager), low resource utilization, fragmented shuffle data, and poor stability leading to frequent FetchFailedException failures.

RSS Goals and Challenges JD's RSS targets compute‑storage separation, elastic high‑availability, specific Spark optimizations (compatibility with Spark 2.x/3.x, adaptive execution, data‑skew handling, pre‑merge of map segments, global map‑side sort, fallback to ESS), comprehensive monitoring, and cloud‑native support via Kubernetes.

Architecture and Implementation RSS is deployed as an independent cluster service. Shuffle write sends map output directly to RSS, which aggregates data per reduce partition and persists it to remote distributed file systems such as HDFS or ChubaoFS. Shuffle read retrieves the aggregated files directly, eliminating short‑lived connections and random‑read overhead. The service uses ETCD for service discovery, health checks, and load balancing, and integrates with Spark's Metrics System and JD's internal monitoring/visualization tools.

Key Features • Data replication via HDFS/CFS block backup. • Deduplication using block‑ID headers and end‑to‑end checksum verification. • Node health monitoring and load balancing through ETCD. • Real‑time monitoring, alerting, and reporting dashboards built on Prometheus, Grafana, and internal TSDB.

Performance Evaluation Benchmarks comparing ESS, JD RSS, and Uber RSS on a 1 TB TPC‑DS workload (13 shuffle nodes) show JD RSS is roughly 4 % slower than ESS on average, with performance comparable to Uber RSS. The tests confirm expected trade‑offs between remote storage latency and local‑disk approaches.

Production Optimization Case During a major JD promotion, a shuffle‑heavy job processing tens of terabytes suffered frequent FetchFailedException failures with ESS, causing unstable runtimes exceeding five hours. After switching to JD RSS, the same job ran stably in about two hours with no fetch failures, reducing resource waste and meeting SLA requirements.

Summary and Outlook Since its first deployment in June 2019, JD RSS has scaled to 240 nodes, handling over 1000 TB of daily shuffle data and more than 220 Spark applications during major sales events, significantly lowering failure rates and improving performance. Future work includes integrating emerging shuffle innovations, expanding multi‑engine support, advancing Spark 3.0 and cloud‑native deployments, and exploring open‑source and commercial opportunities.

performancecloud-nativeBig DataSparkRemote Shuffle ServiceShuffle Optimization
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.