Big Data 7 min read

Hive on Spark Tuning Parameters and Best Practices

This article explains how to tune Hive on Spark by adjusting driver, executor, and Hive configuration parameters—including CPU cores, memory allocations, dynamic allocation, and join thresholds—to achieve optimal performance when running on YARN.

Big Data Technology & Architecture

Mar 8, 2020

Hive on Spark Tuning Parameters and Best Practices

Hive on Spark replaces the traditional MapReduce engine for Hive queries, offering higher efficiency but requiring careful parameter tuning to maximize performance; the article assumes a Spark-on-YARN deployment.

Driver Parameters spark.driver.cores specifies the number of CPU cores available to each driver; a value of 1 is usually sufficient. spark.driver.memory and spark.driver.memoryOverhead define the driver’s heap and off‑heap memory, typically set between 512 MB and 4 GB using the same 80%‑20% split as executors (e.g., ~819 MB driver memory and ~205 MB overhead).

Executor Parameters spark.driver.cores (likely a typo, should be spark.executor.cores) indicates the number of CPU cores per executor; values between 3 and 6 are recommended to avoid HDFS race conditions. For a node with 32 cores, setting yarn.nodemanager.resource.cpu-vcores to 28 allows 7 executors with 4 cores each, or 5 cores each if the node‑manager limit is 26.

Executor memory is controlled by spark.executor.memory (heap) and spark.executor.memoryOverhead (off‑heap). The total executor memory can be estimated with the formula:

yarn.nodemanager.resource.memory-mb * (spark.executor.cores / yarn.nodemanager.resource.cpu-vcores)

From the calculated total, allocate 80%‑85% to heap memory and the remainder to off‑heap memory. For a node with 120 GB usable memory and 4 executor cores, the total is ≈17 554 MB, resulting in ~13 166 MB heap and ~4 389 MB overhead. The sum must not exceed yarn.scheduler.maximum-allocation-mb. spark.executor.instances defines the total number of executors for a query; with ten 32‑core/128 GB nodes each running seven executors, the theoretical maximum is 70, but a practical setting is about half that to leave resources for the driver and other services. spark.dynamicAllocation.enabled should be set to true to allow Spark to adjust the number of executors dynamically, which is useful in multi‑tenant Hive clusters.

Hive Parameters

The join conversion threshold hive.auto.convert.join.noconditionaltask.size defaults to 10 MB in Hive on MR; when moving to Spark, increase it to 100‑200 MB (or larger if memory permits) because Spark estimates table size in memory, which can be smaller than on‑disk sizes.

To merge small files, set hive.merge.sparkfiles to true (the Spark equivalent of hive.merge.mapredfiles on MR). The thresholds hive.merge.smallfiles.avgsize and hive.merge.size.per.task remain unchanged.

—END—

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Performance Tuning Hive YARN Spark

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.