Big Data 19 min read

Mastering Spark UI: Deep Dive into Metrics, Tuning, and Real‑World Cases

This article provides a comprehensive guide to Spark UI, explaining each primary and secondary tab, the key metrics they expose, and how to interpret them for performance bottleneck detection, followed by two detailed case studies and practical tuning recommendations for Spark workloads.

DeWu Technology

Mar 2, 2026

Mastering Spark UI: Deep Dive into Metrics, Tuning, and Real‑World Cases

Overview

Spark UI is the built‑in web interface of Apache Spark that offers real‑time visual insight into job execution, stages, tasks, SQL plans, executor resources, storage status and runtime environment, helping developers and operators locate bottlenecks such as data skew, shuffle overhead or scheduler delay.

Primary (first‑level) tabs

Executors

The Executors tab contains a Summary section (aggregated metrics of all executors) and an Executors table that shows per‑executor details such as CPU cores, memory, disk, task count and resource usage, allowing quick detection of overloaded or idle executors.

Environment

The Environment tab lists Spark configuration properties, JVM settings and classpath entries, giving a snapshot of the runtime environment so users can verify that the job runs with the intended parameters.

Storage

The Storage tab displays cached RDD/DataFrame information, including number of cached partitions, fraction cached, memory size and disk size, which is essential for diagnosing memory pressure and OOM risks.

SQL

The SQL tab visualises logical and physical execution plans for Spark SQL queries, exposing stages, shuffle operations and operator costs, useful for diagnosing slow queries and validating adaptive query execution settings.

Stages

The Stages tab shows each stage’s DAG, event timeline and task‑level metrics (summary and detailed view). It is the main entry for pinpointing slow tasks, data skew and shuffle bottlenecks.

Jobs

The Jobs tab provides a top‑level view of all Spark jobs, their status, duration and failure information, serving as the entry point for deeper analysis of individual jobs.

Secondary (detail) pages

SQL, Jobs and Stages each have a second‑level detail page that expands the high‑level view into full execution DAGs, task timelines and fine‑grained metrics such as shuffle read/write, spill, GC time and locality level.

Key metrics and their interpretation

Executor metrics – CPU, memory, disk, task count.

Storage metrics – Cached partitions, fraction cached, memory vs disk size.

Shuffle metrics – Write/Read time, data size, spill (memory & disk) and the derived “Explosion Ratio”.

Task metrics – Duration, shuffle read size, GC time, locality level.

Optimization case studies

Case 1: Slow scan and memory pressure

A long‑running stage showed tasks processing ~25 MB each (far below the ideal 128‑256 MB) and high scheduler delay due to many small tasks. Solution: increase table split size (e.g., spark.sql.odps.split.size=512MB) and raise executor memory, which reduced stage time by ~20 minutes.

Case 2: Insufficient parallelism after shuffle

Shuffle stages suffered from low parallelism. Adjusted adaptive query execution parameters such as spark.sql.adaptive.enabled, spark.sql.shuffle.partitions, and advisory partition size to increase the number of shuffle partitions, thereby improving throughput.

General tuning guidelines

Balance parallelism and memory: more parallelism reduces per‑task memory but may increase spill; more memory reduces parallelism and can cause longer GC.

Use the “Explosion Ratio” (Spill Disk / Spill Memory) to estimate memory pressure.

Configure executor memory and cores so that each core gets 4‑8 GB memory and total cluster parallelism matches the workload.

Prefer broadcast joins for small tables to avoid shuffle.

Conclusion

Effective Spark performance tuning relies on the layered design of Spark UI – high‑level overviews for quick health checks and deep‑dive detail pages for precise diagnosis. By monitoring executor, storage, shuffle and task metrics, practitioners can systematically identify whether bottlenecks stem from memory, parallelism, or scheduling, and apply targeted configuration changes.

case study Optimization UI big data Metrics Performance Tuning Spark

Written by

DeWu Technology

A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Overview

Primary (first‑level) tabs

Executors

Environment

Storage

SQL

Stages

Jobs

Secondary (detail) pages

Key metrics and their interpretation

Optimization case studies

Case 1: Slow scan and memory pressure

Case 2: Insufficient parallelism after shuffle

General tuning guidelines

Conclusion

DeWu Technology

How this landed with the community

Was this worth your time?

0 Comments

Case 1: Slow scan and memory pressure

Case 2: Insufficient parallelism after shuffle