Comprehensive Spark Optimization Guide: Development, Resource, Skew, Shuffle, and Additional Tips
This article presents a detailed summary of Meituan's Spark optimization techniques, covering development‑level RDD tuning, resource parameter configuration, data‑skew mitigation, shuffle improvements, and the advantages of using DataFrame/Dataset APIs for better performance.
Overview : Spark performance bottlenecks usually stem from limited cluster resources (CPU, network bandwidth, memory) and excessive serialization and shuffle overhead.
Development tuning :
Avoid creating duplicate RDD s; reuse the same RDD when possible.
Persist frequently used RDD s to avoid recomputation.
Minimize use of shuffle operators; prefer map‑side combine (e.g., reduceByKey with combiner) and high‑performance operators.
Broadcast large variables to share them across tasks and reduce OOM risk.
Use Kryo for faster serialization.
Optimize data structures: prefer primitive types (Int, Long), strings, and arrays over Java objects and heavy collections such as HashMap or LinkedList.
Resource parameter tuning :
Executor configuration: spark.executor.memory, spark.executor.instances, spark.executor.cores.
Driver configuration: spark.driver.memory (usually 1‑4 GB if no collect), spark.driver.cores.
Parallelism: spark.default.parallelism (RDD API) and spark.sql.shuffle.partitions (DataFrame/Dataset API).
Network timeout: spark.network.timeout.
Data locality: spark.locality.wait.
JVM/GC options: spark.executor.extraJavaOptions, spark.driver.extraJavaOptions.
Data skew tuning :
Pre‑process data with Hive ETL to reduce skew.
Filter out rare skewed keys when possible.
Increase shuffle parallelism to spread load across more tasks.
Two‑stage aggregation: add random prefix → local aggregate → remove prefix → global aggregate (effective for sum/count).
Convert reduce‑join to map‑join by broadcasting the small table.
Use random prefix expansion on RDD s for join when both sides are large (note the N‑fold increase in the larger side).
Shuffle tuning :
Understand shuffle write/read: map tasks write partitioned data to disk; reduce tasks pull required partitions over the network.
Prefer sort‑based shuffle (post‑1.1) and avoid shuffle operators when possible.
Reduce shuffle data size: deduplicate before union ( A.union(B).distinct() vs. A.distinct().union(B.distinct()).distinct()), replace joins with broadcast‑filter when feasible.
Tune buffers: spark.shuffle.file.buffer (write buffer) and spark.reducer.maxSizeInFlight (read buffer).
Adjust spark.shuffle.sort.bypassMergeThreshold to skip sorting for small shuffle reads.
Other optimization items :
Use DataFrame/Dataset APIs to leverage Catalyst optimizer, off‑heap memory, and better execution plans.
Comparison of RDD, DataFrame, and Dataset:
RDD: distributed Java objects, type‑safe at compile time, but heavy serialization.
DataFrame: distributed rows with schema, supports column pruning, off‑heap storage.
Dataset: typed DataFrame, encoder‑based serialization, supports both structured and unstructured data.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
