Big Data 19 min read

Hive Optimization Techniques: Column/Partition Pruning, Predicate Pushdown, Join Strategies, and MapReduce Tuning

This article provides a comprehensive guide to improving Hive query performance by covering column and partition pruning, predicate pushdown, replacing ORDER BY with SORT BY, using GROUP BY instead of DISTINCT, fine‑tuning join operations, and optimizing MapReduce parameters such as mapper/reducer counts, file merging, compression, JVM reuse, parallel execution, strict mode, and storage formats.

Big Data Technology & Architecture

Jun 24, 2019

Hive Optimization Techniques: Column/Partition Pruning, Predicate Pushdown, Join Strategies, and MapReduce Tuning

Column Pruning and Partition Pruning

Read only the required columns and partitions to avoid full table scans. Example query:

select uid,event_type,record_data
from calendar_record_log
where pt_date >= 20190201 and pt_date <= 20190224
and status = 0;

Enable with hive.optimize.cp and hive.optimize.pruner (both true by default).

Predicate Pushdown

Push WHERE predicates as early as possible to reduce data volume. Example:

select a.uid,a.event_type,b.topic_id,b.title
from calendar_record_log a
left outer join (
  select uid,topic_id,title
  from forum_topic
  where pt_date = 20190224 and length(content) >= 100
) b on a.uid = b.uid
where a.pt_date = 20190224 and status = 0;

Controlled by hive.optimize.ppd (default true) and the PredicatePushDown optimizer.

Replace ORDER BY with SORT BY

Use SORT BY (optionally with DISTRIBUTE BY) to allow multiple reducers and avoid a single reducer bottleneck.

select uid,upload_time,event_type,record_data
from calendar_record_log
where pt_date >= 20190201 and pt_date <= 20190224
distribute by uid
sort by upload_time desc,event_type desc;

Use GROUP BY Instead of DISTINCT

For large datasets, rewrite COUNT(DISTINCT …) as a GROUP BY query, aware that it may launch two MR jobs.

select count(1) from (
  select uid from calendar_record_log
  where pt_date >= 20190101
  group by uid
) t;

To aggregate multiple columns with GROUP BY, use UNION ALL tricks as shown in the source.

Group By Configuration Adjustments

Map‑side pre‑aggregation : enable with hive.map.aggr (default true) and tune hive.groupby.mapaggr.checkinterval.

Skew handling : enable with hive.groupby.skewindata (default false) to split a skewed GROUP BY into two jobs.

Join Basic Optimizations

Build table (small table) placement : ensure the small table appears before the large table so Hive can load it into memory.

select a.event_type,a.event_code,a.event_desc,b.upload_time
from calendar_event_code a
inner join (
  select event_type,upload_time from calendar_record_log
  where pt_date = 20190225
) b on a.event_type = b.event_type;

Multiple joins with same key : Hive can combine them into a single MR job; different keys cause separate jobs (handled by CorrelationOptimizer).

Map join : enable with hive.auto.convert.join (default true) and tune hive.mapjoin.smalltable.filesize and hive.mapjoin.cache.numrows.

Bucket map join : for bucketed tables, use hive.optimize.bucketmapjoin.

Skew join : enable with hive.optimize.skewjoin (default false) and configure thresholds via hive.skewjoin.key and hive.skewjoin.mapjoin.map.tasks.

Handling Join Skew in SQL

Filter out null or meaningless keys, or randomize them to disperse skew. Example:

select a.uid,a.event_type,b.nickname,b.age
from (
  select (case when uid is null then cast(rand()*-10240 as int) else uid end) as uid,
         event_type
  from calendar_record_log
  where pt_date >= 20190201
) a
left outer join (
  select uid,nickname,age from user_info where status = 4
) b on a.uid = b.uid;

Other techniques include isolating skewed keys into temporary tables and casting data types to align join keys.

MapReduce Optimizations

Adjust mapper count : control via mapred.map.tasks, mapred.min.split.size, mapred.max.split.size, and block size calculations.

Adjust reducer count : set mapred.reduce.tasks directly or let Hive infer using hive.exec.reducers.bytes.per.reducer and hive.exec.reducers.max.

Merge small files : set input format to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat and configure hive.merge.mapfiles, hive.merge.mapredfiles, hive.merge.size.per.task, etc.

Enable compression : set hive.exec.compress.intermediate and hive.exec.compress.output to true, choose codec org.apache.hadoop.io.compress.SnappyCodec, and select compression type (BLOCK or RECORD).

JVM reuse : set mapred.job.reuse.jvm.num.tasks to reuse a JVM across multiple tasks.

Parallel Execution and Local Mode

Enable parallel job execution with hive.exec.parallel (true) and tune thread count via hive.exec.parallel.thread.number. Activate local mode for tiny workloads by setting hive.exec.mode.local.auto to true and respecting input size and task limits.

Strict Mode

Enforce safety by setting hive.mapred.mode to strict, which blocks queries that lack partition filters, produce Cartesian joins, or use ORDER BY without LIMIT.

Choosing Appropriate Storage Formats

Specify storage format in CREATE TABLE using STORED AS. Common formats: TextFile, SequenceFile, RCFile, Avro, ORC, Parquet. Parquet and ORC are columnar and preferred for OLAP workloads; TextFile is used for simple staging.

Conclusion

The guide covers many Hive performance knobs; readers are encouraged to test and adapt them to their own workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Performance Tuning Hive sql-optimization MapReduce join optimization

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.