Big Data 13 min read

How to Optimize Compute Resource Governance in Data Warehouses with Spark & Hive

This article walks through practical steps for governing compute resources in a data warehouse, covering problem identification, strategic thinking, Spark and Hive tuning, small‑file handling, DQC improvement, high‑consumption task optimization, scheduling adjustments, and measurable performance gains.

Data Thinking Notes

Mar 22, 2023

How to Optimize Compute Resource Governance in Data Warehouses with Spark & Hive

Preface

The article starts from the practice of compute resource governance, aiming to clearly explain how to conduct such governance and apply it to other projects.

Problem Occurrence

Before governance (Dec 2022) the team identified several issues:

30+ high‑consumption tasks caused by data skew and extensive resource usage.

Over 2 million small files leading to excessive reads and performance loss.

Inefficient scheduling with most jobs running 2‑5 am, causing CPU saturation.

Invalid DQC (data quality monitoring) jobs and undersized monitoring resources.

Redundant or unused tasks reducing resource reuse.

Lack of tuning parameters and continued use of older MapReduce/Spark 2 engines.

Thinking and Action

The team prioritized actions by evaluating impact, cost, and difficulty, deciding to start with simple parameter tuning and engine migration, then address small files, DQC, high‑consumption tasks, scheduling, and finally decommission unused models and metrics.

Governance Actions

(1) Switch most tasks to Spark 3 and add tuning parameters

Enable Spark 3’s Adaptive Query Execution (AQE) and Z‑Order sorting to improve performance.

(2) Small‑file governance

Use the internal data‑governance platform to list tables with many small files and apply dynamic partition refresh with Spark 3, which can automatically merge small files.

set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table xxx.xxx partition (ds)
select column, ds
from xxx.xxx;

For tables with few or no partitions, rebuild the table and refresh data.

Prevention measures:

Use Spark 3 engine to auto‑merge small files.

Reduce the number of Reduce tasks via parameters.

Apply Distribute By Rand to control data distribution.

Add small‑file merge parameters.

Process source‑extracted tables to prevent downstream small‑file accumulation.

(3) DQC governance

Identify and deactivate invalid DQC jobs, then increase DQC resources (Driver 2048 MB, 2 Executors, each 4096 MB) to cut runtime from >10 min to ~3 min.

(4) High‑consumption task tuning

Benchmark core tasks, ensure each task’s consumption stays below the average, and apply optimizations such as splitting overly joined tables, reducing resource allocations, adding Distribute By Rand, pruning unused columns/partitions, replacing Distinct with Group By, using sub‑queries, and leveraging Map Join.

(5) Scheduling optimization

Analyze ~600 tasks running 2‑5 am, separate core and non‑core jobs, shift non‑core jobs to off‑peak periods, and advance core job schedules to improve resource utilization.

(6) Task sinking and decommissioning

Move chimney‑model metrics into DWS for reuse and retire unused tasks after checking metadata lineage to avoid breaking downstream reports.

Governance Effects

Upgraded 137 Hive/Spark 2 tasks to Spark 3.1, achieving 43% faster execution, 41% lower CPU, and 46% lower memory usage.

Reduced small‑file count from 2.16 M to 0.67 M across 30+ tables.

Decommissioned 50+ invalid DQC jobs, cutting runtime to under 3 min.

Optimized 20+ tasks and retired 10+, saving 146 min total runtime, 8 M CPU seconds, and 26 M MB memory.

Rescheduled night‑time jobs, lowering 2‑5 am resource usage from >90% to ~50%.

Conclusion

Compute resource governance aims to reduce cost and increase efficiency, enabling more tasks with limited resources while standardizing development practices and fostering technical growth within the team.

Governance should be a continuous practice, with regular scans and scoring to provide actionable guidance for every task.

Parameter Details

Hive

set hive.auto.convert.join=true;

set hive.map.aggr=true;

set hive.groupby.skewindata=true;

set hive.merge.mapfiles=true;

set mapreduce.map.memory.mb=4096;

set mapreduce.reduce.memory.mb=4096;

set hive.exec.dynamic.partition.mode=nonstrict;

Spark

set spark.sql.legacy.parquet.datetimeRebaseModeInRead=LEGACY;

set spark.sql.adaptive.enabled=true;

set spark.sql.hive.convertInsertingPartitionedTable=false;

set spark.sql.finalStage.adaptive.advisoryPartitionSizeInBytes=2048M;

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Hive Spark Compute Governance

Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.