Big Data 13 min read

Improving Spark Job Parallelism on YARN: Diagnosis, Configuration, and Performance Gains

This article details a real‑world investigation of Spark SQL job latency on a YARN cluster, explains how switching the scheduler to FAIR mode, creating resource pools, and consolidating small Parquet files dramatically reduced scheduler delay and cut execution time from over 100 seconds to under 20 seconds.

Big Data Technology & Architecture

Jan 5, 2021

Improving Spark Job Parallelism on YARN: Diagnosis, Configuration, and Performance Gains

The author observed that concurrent Spark SQL jobs on a Spark 1.6.1 on YARN cluster showed little speedup, with many simple aggregation queries occupying excessive cluster resources and causing long waiting times.

Initial mitigation involved setting spark.scheduler.mode to FAIR instead of the default FIFO, allowing later jobs to start if resources were still available.

Next, ten scheduler pools were created to partition cluster resources; jobs could be directed to a specific pool, effectively slicing the cluster into isolated slices.

When submitting a job, the target pool was specified via the SparkContext property: sc.setLocalProperty("spark.scheduler.pool", "your_pool_id") Despite these changes, parallel execution only yielded about a 50% reduction in runtime. Detailed analysis of Spark Web UI revealed that most of the delay originated from scheduler delay in a stage with 336 tasks, where each task waited ~0.5 s before execution.

The root cause was identified as an excessive number of tiny Parquet files (over 300 files, many as small as 1 KB). Consolidating these files into larger chunks reduced the number of tasks and the associated scheduler delay.

Command‑line listings of HDFS showed the file distribution before and after consolidation, and performance tests demonstrated a reduction from 99 s (2016‑11‑17 data) to 16 s (2016‑12‑12 data) for 100 concurrent jobs.

Final observations note that while the optimization achieved a ten‑fold speedup, remaining scheduler delay and task deserialization time suggest further tuning is possible, and that future improvements must balance resource constraints with user requirements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Scheduler YARN Spark Spark SQL Parquet

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.