Big Data 9 min read

HiveSQL Classic Optimization Cases: Partitioning, Subset Decomposition, and Percentile Approximation Improvements

This article presents three HiveSQL optimization case studies—restructuring a large‑scale query with partitioned tables, breaking a complex window‑function query into smaller subsets with joins, and refactoring excessive PERCENTILE_APPROX calls—demonstrating how each change reduces execution time from hours to minutes and improves overall performance.

Big Data Technology & Architecture

Sep 24, 2020

HiveSQL Classic Optimization Cases: Partitioning, Subset Decomposition, and Percentile Approximation Improvements

HiveSQL Classic Optimization Case 1 : The original query scans over 770 million rows and takes more than an hour. By analyzing the grouping fields (dt, strategy, ab_group, source) the author creates a new external table partitioned by these columns, inserts the data, and runs the same aggregation, reducing runtime to about five minutes.

SELECT dt as DATA_DATE, STRATEGY, AB_GROUP, SOURCE,
       count(distinct case when lower(event) not like '%push%' and event!='corner_mark_show' then udid else null end) as DAU,
       count(case when event='client_show' then 1 else null end) as TOTAL_VSHOW,
       ...
FROM dwb_v8sp_tmp.base_report_bystrategy_byab_source_column_zkl
GROUP BY dt, strategy, ab_group, source;

Partition count analysis shows 2 016 partitions (dt=1, source=2, strategy=14, ab_group=72), which is acceptable.

CREATE EXTERNAL TABLE dwb_v8sp_tmp.base_report_bystrategy_byab_source_column_lym(
  event STRING,
  udid STRING,
  vid STRING,
  duration STRING,
  body_source STRING,
  play_cnts STRING)
PARTITIONED BY (dt STRING, source STRING, strategy STRING, ab_group STRING);

INSERT INTO dwb_v8sp_tmp.base_report_bystrategy_byab_source_column_lym
PARTITION(dt,source,strategy,ab_group)
SELECT event,udid,vid,duration,body_source,play_cnts,dt,source,strategy,ab_group
FROM dwb_v8sp_tmp.base_report_bystrategy_byab_source_column_zkl;

Result verification confirms data consistency, and the new table query finishes in about five minutes.

HiveSQL Classic Optimization Case 2 : A complex SQL with many window functions fails after long execution. The solution splits the workload into three intermediate tables (zyt1, zyt2, zyt3) using simpler aggregations and then joins them to produce the final result, dramatically improving performance.

-- Create subset tables
CREATE TABLE bi_tmp.zyt1 PARTITIONED BY(event) AS
SELECT udid, MIN(time_local) AS first_time, MAX(time_local) AS last_time, event
FROM dwb_v8sp.event_column_info_new_hour
WHERE event IN ('app_start','app_exit','effective_play','share_succ','like','unlike','like_comment','unlike_comment','comment_success')
  AND dt BETWEEN '2019-03-01' AND '2019-03-03'
GROUP BY udid, event;

CREATE TABLE bi_tmp.zyt2 PARTITIONED BY(dt,event) AS
SELECT udid, COUNT(time_local) AS event_count_per_day, SUM(duration) AS event_duration_each_day, dt, event
FROM dwb_v8sp.event_column_info_new_hour
WHERE ...
GROUP BY udid, dt, event;

CREATE TABLE bi_tmp.zyt3 PARTITIONED BY(dt,event) AS
SELECT aid, imei, idfa, udid, duration, time_local, hour, source, dt, event
FROM dwb_v8sp.event_column_info_new_hour
WHERE ...;

-- Assemble final table
CREATE TABLE bi_tmp.aloha_UserLoyalty_190301_190303 AS
SELECT t3.aid, t3.imei, t3.idfa, t3.udid, t3.event, t3.duration, t3.dt, t3.time_local, t3.hour, t3.source,
       t1.first_time, t1.last_time, t2.event_count_per_day, t2.event_duration_each_day
FROM bi_tmp.zyt1 t1
JOIN bi_tmp.zyt2 t2 ON t1.event=t2.event AND t1.udid=t2.udid
JOIN bi_tmp.zyt3 t3 ON t2.dt=t3.dt AND t2.event=t3.event AND t2.udid=t3.udid;

Validation shows the row counts match the original query, confirming correctness.

HiveSQL Classic Optimization Case 3 : The original query heavily uses the PERCENTILE_APPROX function, causing severe slowdown. By recognizing that the function accepts array arguments, the author rewrites the query to compute percentiles in a single pass, achieving more than a four‑fold speedup.

-- Original (slow) query snippet
SELECT PERCENTILE_APPROX(metric1, 0.5) AS p50_metric1,
       PERCENTILE_APPROX(metric2, 0.5) AS p50_metric2,
       ...
FROM large_table;

-- Optimized (fast) query snippet
SELECT PERCENTILE_APPROX(array(metric1, metric2, ...), 0.5) AS p50_all
FROM large_table;

Performance tests before and after optimization are illustrated with screenshots, confirming the improvement.

The article concludes with a copyright notice and a call for readers to like, share, and follow the "import_bigdata" WeChat public account.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Optimization Big Data Hive partitioning SQL Performance HiveSQL

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.