Mastering Spark UI: Deep Dive into Metrics, Tuning, and Real‑World Cases
This article provides a comprehensive guide to Spark UI, explaining each primary and secondary tab, the key metrics they expose, and how to interpret them for performance bottleneck detection, followed by two detailed case studies and practical tuning recommendations for Spark workloads.
Overview
Spark UI is the built‑in web interface of Apache Spark that offers real‑time visual insight into job execution, stages, tasks, SQL plans, executor resources, storage status and runtime environment, helping developers and operators locate bottlenecks such as data skew, shuffle overhead or scheduler delay.
Primary (first‑level) tabs
Executors
The Executors tab contains a Summary section (aggregated metrics of all executors) and an Executors table that shows per‑executor details such as CPU cores, memory, disk, task count and resource usage, allowing quick detection of overloaded or idle executors.
Environment
The Environment tab lists Spark configuration properties, JVM settings and classpath entries, giving a snapshot of the runtime environment so users can verify that the job runs with the intended parameters.
Storage
The Storage tab displays cached RDD/DataFrame information, including number of cached partitions, fraction cached, memory size and disk size, which is essential for diagnosing memory pressure and OOM risks.
SQL
The SQL tab visualises logical and physical execution plans for Spark SQL queries, exposing stages, shuffle operations and operator costs, useful for diagnosing slow queries and validating adaptive query execution settings.
Stages
The Stages tab shows each stage’s DAG, event timeline and task‑level metrics (summary and detailed view). It is the main entry for pinpointing slow tasks, data skew and shuffle bottlenecks.
Jobs
The Jobs tab provides a top‑level view of all Spark jobs, their status, duration and failure information, serving as the entry point for deeper analysis of individual jobs.
Secondary (detail) pages
SQL, Jobs and Stages each have a second‑level detail page that expands the high‑level view into full execution DAGs, task timelines and fine‑grained metrics such as shuffle read/write, spill, GC time and locality level.
Key metrics and their interpretation
Executor metrics – CPU, memory, disk, task count.
Storage metrics – Cached partitions, fraction cached, memory vs disk size.
Shuffle metrics – Write/Read time, data size, spill (memory & disk) and the derived “Explosion Ratio”.
Task metrics – Duration, shuffle read size, GC time, locality level.
Optimization case studies
Case 1: Slow scan and memory pressure
A long‑running stage showed tasks processing ~25 MB each (far below the ideal 128‑256 MB) and high scheduler delay due to many small tasks. Solution: increase table split size (e.g., spark.sql.odps.split.size=512MB) and raise executor memory, which reduced stage time by ~20 minutes.
Case 2: Insufficient parallelism after shuffle
Shuffle stages suffered from low parallelism. Adjusted adaptive query execution parameters such as spark.sql.adaptive.enabled, spark.sql.shuffle.partitions, and advisory partition size to increase the number of shuffle partitions, thereby improving throughput.
General tuning guidelines
Balance parallelism and memory: more parallelism reduces per‑task memory but may increase spill; more memory reduces parallelism and can cause longer GC.
Use the “Explosion Ratio” (Spill Disk / Spill Memory) to estimate memory pressure.
Configure executor memory and cores so that each core gets 4‑8 GB memory and total cluster parallelism matches the workload.
Prefer broadcast joins for small tables to avoid shuffle.
Conclusion
Effective Spark performance tuning relies on the layered design of Spark UI – high‑level overviews for quick health checks and deep‑dive detail pages for precise diagnosis. By monitoring executor, storage, shuffle and task metrics, practitioners can systematically identify whether bottlenecks stem from memory, parallelism, or scheduling, and apply targeted configuration changes.
DeWu Technology
A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
