Big Data 13 min read

How EMR Serverless Storage Cuts Costs up to 55% for Shuffle‑Heavy Spark Jobs

A performance comparison of Amazon EMR Serverless Storage on a 3 TB TPC‑DS benchmark shows up to 55 % cost reduction and 25 % faster runtimes for shuffle‑intensive Spark jobs, while outlining usage limits and providing Python tools to analyze shuffle data from Spark event logs.

Amazon Cloud Developers
Amazon Cloud Developers
Amazon Cloud Developers
How EMR Serverless Storage Cuts Costs up to 55% for Shuffle‑Heavy Spark Jobs

Spark jobs need temporary storage for shuffle data, and Dynamic Resource Allocation (DRA) struggles when the external shuffle service is unavailable in environments such as k8s or EMR Serverless, leading to inefficient executor release.

Amazon EMR Serverless Storage, introduced in EMR 7.12+, decouples remote shuffle from executors, allowing DRA to work more efficiently and enabling faster resource release.

Benchmark Setup

We evaluated Amazon EMR Serverless Storage using the TPC‑DS 3 TB benchmark (105 SQL queries). Two EMR Serverless applications were run: one with Serverless Storage enabled and one without. Environment details:

EMR Serverless version: 7.12.0 (arm64, us-east-1)

Dataset: TPCDS‑3TB

Driver: 4 Cores, 4 GiB memory

Executor: dynamicAllocation.initialExecutors 3, 4 Cores, 8 GiB memory

Storage: the non‑Serverless application used the default 20 GB per executor (free); the Serverless Storage application required no explicit storage configuration.

Results

Overall cost reduction of 15.5 % with comparable runtime.

For the 20 queries whose shuffle data ranged from 10 GB to 100 GB, average cost saving was 13.32 % and runtime decreased by 6.5 % .

For the 3 queries with 100 GB–200 GB shuffle data, average cost saving was 55.16 % and runtime decreased by 25.35 % .

Queries with less than 10 GB of shuffle data showed no clear cost or performance advantage.

These results indicate that EMR Serverless Storage delivers cost and performance benefits when shuffle data exceeds 10 GB, with larger gains for shuffle‑intensive workloads.

Limitations

As of 2025‑12‑12, Serverless Storage is supported only on EMR Serverless (EMR 7.12+); it is not available on EMR on EC2 or EMR on EKS.

Each job can store a maximum of 200 GB of intermediate results; jobs exceeding this limit fail.

Worker configurations of 1 or 2 vCPU are not supported.

Shuffle Data Extraction Tool

Shuffle size can be obtained by parsing Spark event logs stored in S3. Example configuration for EMR Serverless:

"s3MonitoringConfiguration": {"logUri": "s3://xxxx/logs/spark-event-log"}

A Python script analyze_spark_shuffle.py parses the event logs and aggregates shuffle read/write bytes and records. A simplified excerpt of the script:

#!/usr/bin/env python3
"""Spark Event Log Shuffle Analyzer"""
import argparse, json, boto3, csv, logging
# ... (argument parsing, S3 listing, log parsing) ...

The script can be executed in parallel using the --threads option, e.g.:

uv run python analyze_spark_shuffle.py \
    --event-log-base-path s3://xxxxx/spark-event-log/ \
    --application-id xxxxx \
    --job-ids xxxx,xxxx \
    --threads 5

MCP Integration

An MCP definition allows the analyzer to be invoked from tools such as Kiro CLI, producing an HTML report:

{
  "mcpServers": {
    "spark-eventlog": {
      "type": "stdio",
      "command": "uvx",
      "args": ["--from","git+https://github.com/yhyyz/spark-eventlog-mcp","spark-eventlog-mcp"],
      "env": {"MCP_TRANSPORT": "stdio"}
    }
  }
}

Using this MCP, users can obtain a comprehensive analysis of any Spark event log.

Summary

EMR Serverless Storage is best suited for shuffle‑heavy Spark jobs (>10 GB), with the most pronounced savings for workloads exceeding 100 GB of shuffle data. For smaller shuffle volumes, traditional storage may be more economical. The provided Python script and MCP enable rapid assessment of shuffle size and migration suitability.

Performance comparison chart
Performance comparison chart
Cost vs. performance chart
Cost vs. performance chart
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

SparkEMR ServerlessCost SavingsShuffle StorageTPC-DS Benchmark
Amazon Cloud Developers
Written by

Amazon Cloud Developers

Official technical community of Amazon Cloud. Shares practical AI/ML, big data, database, modern app development, IoT content, offers comprehensive learning resources, hosts regular developer events, and continuously empowers developers.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.