Performance Tuning of Hive on Spark in YARN Mode
This article explains how to optimize Hive on Spark running on YARN, covering YARN node resource configuration, Spark executor and driver memory settings, dynamic allocation, parallelism, and key Hive parameters to achieve superior performance compared to Hive on MapReduce.
Hive on Spark delivers much better performance than Hive on MapReduce while providing the same functionality, and HiveQL can run unchanged.
This guide focuses on tuning Hive on Spark when it runs in YARN mode, assuming a node with 32 CPU cores and 120 GB memory.
YARN resource configuration
Set the number of vcores and memory available to YARN based on the node’s capacity:
yarn.nodemanager.resource.cpu-vcores=28 yarn.nodemanager.resource.memory-mb=100*1024Reserve cores and memory for the OS, HDFS DataNode and NodeManager, leaving the rest for YARN.
Spark executor and driver settings
Allocate executor cores (e.g., 4) so that the total number of executors fits the available cores (28 cores → 7 executors). Calculate executor memory (≈14 GB) and set overhead to 15‑20%:
spark.executor.cores=4 spark.executor.memory=12g spark.executor.memoryOverhead=2gDriver memory should be chosen based on total YARN memory (X): 12 GB if X > 50 GB, 4 GB if 12 GB < X < 50 GB, 1 GB if 1 GB < X < 12 GB, otherwise 256 MB.
Number of executors and dynamic allocation
Maximum executors per node = 7; total executors = nodes × 7 (e.g., 40 nodes → 280 executors). Use static allocation for benchmarks, but enable dynamic allocation in multi‑user production environments.
Parallelism and reducer settings
Ensure enough tasks are generated to keep all executors busy. Adjust hive.exec.reducers.bytes.per.reducer to control reducer count; Spark is less sensitive to this value than MapReduce.
Hive configuration parameters
Key settings that affect performance include:
hive.optimize.reducededuplication.min.reducer=4
hive.optimize.reducededuplication=true
hive.merge.mapfiles=true
hive.merge.mapredfiles=false
hive.merge.smallfiles.avgsize=16000000
hive.merge.size.per.task=256000000
hive.merge.sparkfiles=true
hive.auto.convert.join=true
hive.auto.convert.join.noconditionaltask=true
hive.auto.convert.join.noconditionaltask.size=20M // increase for Spark, e.g., 200M
hive.optimize.bucketmapjoin.sortedmerge=false
hive.map.aggr.hash.percentmemory=0.5
hive.map.aggr=true
hive.optimize.sort.dynamic.partition=false
hive.stats.autogather=true
hive.stats.fetch.column.stats=true
hive.compute.query.using.stats=true
hive.limit.pushdown.memory.usage=0.4
hive.optimize.index.filter=true
hive.exec.reducers.bytes.per.reducer=67108864
hive.smbjoin.cache.rows=10000
hive.fetch.task.conversion=more
hive.fetch.task.conversion.threshold=1073741824
hive.optimize.ppd=trueAdjust hive.auto.convert.join.noconditionaltask.size to a larger value for Spark because it uses rawDataSize statistics instead of totalSize .
Pre‑warming YARN containers
Enable hive.prewarm.enabled=true and set hive.prewarm.numcontainers (default 10) to reduce first‑query latency by pre‑starting executors.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.