Big Data 11 min read

Hive SQL Optimization Techniques and Best Practices

This article provides a comprehensive guide to Hive SQL performance tuning, covering optimization goals, common pitfalls, execution flow, table and job settings, map, shuffle, reduce, and query-level improvements such as join, bucket join, group‑by, and count‑distinct optimizations.

Big Data Technology & Architecture

Apr 24, 2019

Hive SQL Optimization Techniques and Best Practices

Hive Optimization Goals

Achieve higher execution efficiency with limited resources.

Common Issues

Data skew

Map task count configuration

Reduce task count configuration

Other performance bottlenecks

Hive Execution Process

HQL → Job → Map/Reduce

Use explain [extended] hql to view the execution plan.

Example query: select col, count(1) from test2 group by col; Explain example: explain select col, count(1) from test2 group by col; Hive Table Optimization

Partitioning

Dynamic partition: set hive.exec.dynamic.partition=true; Dynamic mode: set hive.exec.dynamic.partition.mode=nonstrict; Bucketing

Enable bucketing: set hive.enforce.bucketing=true; Enable sorting: set hive.enforce.sorting=true; Data layout: keep similar data together.

Hive Job Optimization

Parallel execution

Enable parallelism: set hive.exec.parallel=true; Thread count: set hive.exec.parallel.thread.numbe=8; Local mode execution

Input size must be hive.exec.mode.local.auto.inputbytes.max (default 128 MB).

Map tasks must be hive.exec.mode.local.auto.tasks.max (default 4).

Reduce tasks should be 0 or 1.

Enable local mode: set hive.exec.mode.local.auto=true; Small file merging

Input merge:

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat

Output merge: set hive.merge.smallfiles.avgsize=256000000; and set hive.merge.size.per.task=64000000; JVM reuse

Reuse JVMs: set mapred.job.reuse.jvm.num.tasks=20; Data compression

Compress final output: set hive.exec.compress.output=true; Output codec:

set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;

Compression type: set mapred.output.compression.type=BLOCK; Intermediate compression: set hive.exec.compress.intermediate=true; Intermediate codec:

set hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

Hive Map Optimization

Map task count calculation (default, goal, split size, etc.)

Formula: compute_map_num = min(split_num, max(default_num, goal_num)) Adjustments:

Increase mapred.min.split.size to reduce map count.

Set a larger mapred.map.tasks for more maps.

Map-side aggregation: set hive.map.aggr=true; Speculative execution: mapred.map.tasks.apeculative.execution Hive Shuffle Optimization

Map side parameters: io.sort.mb, io.sort.spill.percent, min.num.spill.for.combine, io.sort.factor, io.sort.record.percent Reduce side parameters: mapred.reduce.parallel.copies, mapred.reduce.copy.backoff, io.sort.factor, mapred.job.shuffle.input.buffer.percent Hive Reduce Optimization

Reduce tasks needed for queries with GROUP BY, JOIN, DISTRIBUTE BY, etc.

Speculative execution: mapred.reduce.tasks.speculative.execution and hive.mapred.reduce.tasks.speculative.execution Reducer count formula: numRTasks = min[maxReducers, input.size/perReducer] maxReducers = hive.exec.reducers.max (default 999)

perReducer = hive.exec.reducers.bytes.per.reducer (default 1 GB)

Direct setting: set mapred.reduce.tasks=10; Hive Query Optimization

Join optimization

Enable map‑join when one table is small: set hive.auto.current.join=true; Map‑join hint: /*+mapjoin(A)*/ Skew join handling: set hive.optimize.skewjoin=true; and set hive.skewjoin.key=100000; Bucket join: tables must be bucketed on the join key with bucket counts in a multiple relationship.

Example DDL:

create table order(cid int,price float) clustered by(cid) into 32 buckets;

Example query: select price from order t join customer s on t.cid=s.id; Join rewrite example (filter push‑down):

Before:

select m.cid,u.id from order m join customer u on m.cid=u.id where m.dt='2013-12-12';

After:

select m.cid,u.id from (select cid from order where dt='2013-12-12') m join customer u on m.cid=u.id;

Group‑by optimization

Enable skew handling: set hive.groupby.skewindata=true; Adjust map‑aggregation interval: set hive.groupby.mapaggr.checkinterval=100000; COUNT DISTINCT optimization

Rewrite as

select count(1) from (select distinct id from tablename) tmp;

or use a UNION‑ALL approach for multiple distinct columns.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Optimization Big Data SQL Hive Hadoop

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.