Hive Performance Tuning: Parallel Execution, Strict Mode, JVM Reuse, and Speculative Execution
This article explains Hive performance tuning techniques, including enabling parallel execution, configuring strict mode to prevent risky queries, reusing JVMs to reduce overhead, and using speculative execution to mitigate slow tasks, with configuration examples and practical considerations.
Below is a review of Hive performance tuning techniques, covering parallel execution, strict mode, JVM reuse, and speculative execution.
Parallel Execution
Enable task parallelism with the following settings:
set hive.exec.parallel=true; //打开任务并行执行
set hive.exec.parallel.thread.number=16; //同一个sql允许最大并行度,默认为8。This is beneficial only when system resources are idle; otherwise parallelism provides little advantage.
Strict Mode
Hive provides a strict mode to prevent execution of high‑risk queries. Set hive.mapred.mode to strict to enable it.
<property>
<name>hive.mapred.mode</name>
<value>strict</value>
<description>
The mode in which the Hive operations are being performed.
In strict mode, some risky queries are not allowed to run. They include:
Cartesian Product.
No partition being picked up for a query.
Comparing bigints and strings.
Comparing bigints and doubles.
Orderby without limit.
</description>
</property>For partitioned tables, scanning all partitions is prohibited unless the WHERE clause contains a partition filter.
Queries using order by must include a limit clause to avoid long‑running reducers.
Cartesian product queries are disallowed because Hive cannot optimize them like relational databases.
JVM Reuse
Reusing JVM instances reduces the overhead of launching a new JVM for each map or reduce task, which is especially useful for jobs with many short‑lived tasks.
<property>
<name>mapreduce.job.jvm.numtasks</name>
<value>10</value>
<description>How many tasks to run per jvm. If set to -1, there is
no limit.
</description>
</property>In Hive you can set:
set mapred.job.reuse.jvm.num.tasks=10;
Note that JVM reuse occupies task slots for the duration of the job, which may lead to idle slots if some reducers run much longer than others.
Speculative Execution
Speculative execution launches duplicate tasks for slow‑running map or reduce tasks, using the result of the task that finishes first.
Enable it in Hadoop’s mapred-site.xml:
<property>
<name>mapreduce.map.speculative</name>
<value>true</value>
<description>If true, then multiple instances of some map tasks
may be executed in parallel.</description>
</property>
<property>
<name>mapreduce.reduce.speculative</name>
<value>true</value>
<description>If true, then multiple instances of some reduce tasks
may be executed in parallel.</description>
</property>Hive also provides its own setting:
<property>
<name>hive.mapred.reduce.tasks.speculative.execution</name>
<value>true</value>
<description>Whether speculative execution for reducers should be turned on. </description>
</property>Whether to enable speculative execution depends on workload characteristics; it can be disabled for latency‑sensitive jobs or enabled for large‑scale jobs where task stragglers would otherwise delay completion.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
