Spark Configuration Parameters and Performance Tuning Guidelines
This article explains the purpose, default values, and practical tuning recommendations for common Spark submit options such as executor counts, memory settings, shuffle behavior, speculation, and various Spark SQL configurations to help users optimize big‑data workloads.
The article provides detailed explanations and tuning recommendations for the most frequently used Spark submit parameters, helping practitioners configure resources and improve performance of big‑data jobs. --num-executors: sets the total number of executors requested by the driver; choose based on queue resources to avoid under‑ or over‑allocation. --executor-memory: defines memory per executor (typically 4‑8 GB); the product of executors and memory must not exceed the cluster queue limit. --executor-cores: sets CPU cores per executor (usually 2‑4); ensure the total cores (executors × cores) stay within the queue’s CPU capacity. --driver-memory: memory for the driver process (1 GB is usually sufficient unless large collect operations are needed). --total-executor-cores: total CPU cores used by all executors (default is all cores in standalone mode). --conf spark.default.parallelism: controls the number of tasks per stage; a good rule of thumb is 2‑3 × (num‑executors × executor‑cores) to keep tasks well‑balanced. --conf spark.storage.memoryFraction: fraction of executor memory reserved for cached RDDs (default 0.6); increase for heavy persistence, decrease if shuffle dominates or frequent GC occurs. --conf spark.shuffle.memoryFraction: fraction of executor memory for shuffle aggregation (default 0.2); raise when shuffle is heavy and persistence is light, lower if GC pressure is observed. --conf spark.sql.codegen: when true, Spark SQL compiles queries to Java bytecode, boosting large or repeated queries but potentially slowing small queries. --conf spark.sql.inMemoryColumnStorage.compressed: enables compression of in‑memory columnar storage (default false). --conf spark.sql.inMemoryColumnStorage.batchSize: batch size for columnar caching (default 1000); larger values may cause OOM, smaller values reduce compression efficiency. --conf spark.sql.parquet.compressed.codec: compression codec for Parquet output (default snappy; alternatives include uncompressed, gzip, lzo). --conf spark.speculation: enables speculative execution; related settings include spark.speculation.interval (100 ms), spark.speculation.multiplier (1.5), and spark.speculation.quantile (0.75) to trigger backup tasks for slow stages. --conf spark.shuffle.consolidateFiles: when using HashShuffleManager, merges shuffle output files to reduce I/O (default false); enabling can improve performance for many shuffle read tasks. --conf spark.shuffle.file.buffer: buffer size for shuffle writes (default 32 KB); increase (e.g., 64 KB) if memory permits to reduce disk writes. --conf spark.reducer.maxSizeInFlight: buffer size for shuffle reads (default 48 MB); raising it can lower network round‑trips when memory is ample. --conf spark.shuffle.io.maxRetries: maximum retry attempts for shuffle read failures (default 3); increase for large‑scale shuffles to improve stability. --conf spark.shuffle.io.retryWait: wait time between shuffle read retries (default 5 s); longer intervals can help in unstable networks. --conf spark.shuffle.manager: selects the shuffle manager (sort, hash, or tungsten‑sort); choose based on whether sorting is required and performance considerations. --conf spark.shuffle.sort.bypassMergeThreshold: threshold (default 200) below which SortShuffleManager bypasses sorting; raising it can avoid unnecessary sorting overhead.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
