How to Optimize Compute Resource Governance in Data Warehouses with Spark & Hive
This article walks through practical steps for governing compute resources in a data warehouse, covering problem identification, strategic thinking, Spark and Hive tuning, small‑file handling, DQC improvement, high‑consumption task optimization, scheduling adjustments, and measurable performance gains.
Preface
The article starts from the practice of compute resource governance, aiming to clearly explain how to conduct such governance and apply it to other projects.
Problem Occurrence
Before governance (Dec 2022) the team identified several issues:
30+ high‑consumption tasks caused by data skew and extensive resource usage.
Over 2 million small files leading to excessive reads and performance loss.
Inefficient scheduling with most jobs running 2‑5 am, causing CPU saturation.
Invalid DQC (data quality monitoring) jobs and undersized monitoring resources.
Redundant or unused tasks reducing resource reuse.
Lack of tuning parameters and continued use of older MapReduce/Spark 2 engines.
Thinking and Action
The team prioritized actions by evaluating impact, cost, and difficulty, deciding to start with simple parameter tuning and engine migration, then address small files, DQC, high‑consumption tasks, scheduling, and finally decommission unused models and metrics.
Governance Actions
(1) Switch most tasks to Spark 3 and add tuning parameters
Enable Spark 3’s Adaptive Query Execution (AQE) and Z‑Order sorting to improve performance.
(2) Small‑file governance
Use the internal data‑governance platform to list tables with many small files and apply dynamic partition refresh with Spark 3, which can automatically merge small files.
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table xxx.xxx partition (ds)
select column, ds
from xxx.xxx;For tables with few or no partitions, rebuild the table and refresh data.
Prevention measures:
Use Spark 3 engine to auto‑merge small files.
Reduce the number of Reduce tasks via parameters.
Apply Distribute By Rand to control data distribution.
Add small‑file merge parameters.
Process source‑extracted tables to prevent downstream small‑file accumulation.
(3) DQC governance
Identify and deactivate invalid DQC jobs, then increase DQC resources (Driver 2048 MB, 2 Executors, each 4096 MB) to cut runtime from >10 min to ~3 min.
(4) High‑consumption task tuning
Benchmark core tasks, ensure each task’s consumption stays below the average, and apply optimizations such as splitting overly joined tables, reducing resource allocations, adding Distribute By Rand, pruning unused columns/partitions, replacing Distinct with Group By, using sub‑queries, and leveraging Map Join.
(5) Scheduling optimization
Analyze ~600 tasks running 2‑5 am, separate core and non‑core jobs, shift non‑core jobs to off‑peak periods, and advance core job schedules to improve resource utilization.
(6) Task sinking and decommissioning
Move chimney‑model metrics into DWS for reuse and retire unused tasks after checking metadata lineage to avoid breaking downstream reports.
Governance Effects
Upgraded 137 Hive/Spark 2 tasks to Spark 3.1, achieving 43% faster execution, 41% lower CPU, and 46% lower memory usage.
Reduced small‑file count from 2.16 M to 0.67 M across 30+ tables.
Decommissioned 50+ invalid DQC jobs, cutting runtime to under 3 min.
Optimized 20+ tasks and retired 10+, saving 146 min total runtime, 8 M CPU seconds, and 26 M MB memory.
Rescheduled night‑time jobs, lowering 2‑5 am resource usage from >90% to ~50%.
Conclusion
Compute resource governance aims to reduce cost and increase efficiency, enabling more tasks with limited resources while standardizing development practices and fostering technical growth within the team.
Governance should be a continuous practice, with regular scans and scoring to provide actionable guidance for every task.
Parameter Details
Hive
set hive.auto.convert.join=true;
set hive.map.aggr=true;
set hive.groupby.skewindata=true;
set hive.merge.mapfiles=true;
set mapreduce.map.memory.mb=4096;
set mapreduce.reduce.memory.mb=4096;
set hive.exec.dynamic.partition.mode=nonstrict;
Spark
set spark.sql.legacy.parquet.datetimeRebaseModeInRead=LEGACY;
set spark.sql.adaptive.enabled=true;
set spark.sql.hive.convertInsertingPartitionedTable=false;
set spark.sql.finalStage.adaptive.advisoryPartitionSizeInBytes=2048M;
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Thinking Notes
Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
