10 Advanced Hive SQL Use Cases: Windows, Skew, JSON, and More
This article presents ten practical Hive SQL scenarios—including window functions for ranking, LAG for time‑interval analysis, random‑salt techniques to mitigate data skew, dynamic partition writes, JSON parsing with UDFs, retention calculations, consecutive‑login detection, regex‑based path analysis, CUBE multi‑dimensional aggregation, and ORC storage optimizations—each accompanied by optimization tips and complete code examples.
Case 1: User Behavior Ranking (Window Functions)
Optimization points:
Handling ties : replace the default ranking with RANK() or DENSE_RANK() when duplicate rankings are needed (see references).
Performance : add PARTITION BY dt to group calculations by day and improve efficiency.
SELECT
user_id,
dt,
visit_count,
RANK() OVER (PARTITION BY user_id ORDER BY visit_count DESC) AS rank
FROM (
SELECT
user_id,
dt,
COUNT(*) AS visit_count
FROM user_behavior
WHERE dt BETWEEN '2025-04-01' AND '2025-04-07'
GROUP BY user_id, dt
) t;Case 2: Time‑Interval Analysis (LAG Function)
Optimization points:
NULL handling : use COALESCE to replace a NULL value returned by LAG(login_time) on the first login.
Date validation : ensure login_time is of DATE or TIMESTAMP type.
SELECT
user_id,
login_time,
COALESCE(LAG(login_time) OVER (PARTITION BY user_id ORDER BY login_time), login_time) AS prev_login,
DATEDIFF(login_time, LAG(login_time) OVER (PARTITION BY user_id ORDER BY login_time)) AS days_diff
FROM user_login_log;Case 3: Data Skew Optimization (Random Salt Method)
Supplementary notes:
Salt quantity : generate ten possible salts with RAND()*10; adjust based on the distribution of the skewed key.
Parameter tuning : enable automatic skew handling by setting hive.optimize.skewjoin=true (see reference).
SET hive.optimize.skewjoin=true;
SELECT /*+ MAPJOIN(b_sub) */
a.order_id,
b.user_id
FROM big_table a
JOIN (
SELECT
user_id,
CONCAT(skew_key, '_', CAST(RAND()*10 AS INT)) AS skew_key_salt
FROM skewed_table
) b_sub
ON a.skew_key = b_sub.skew_key_salt;Case 4: Dynamic Partition Write
Common error fixes:
Partition field check : ensure the partition column dt appears in the SELECT list.
Partition type conversion : if dt is a string, convert it with DATE_FORMAT before inserting.
SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT INTO TABLE dwd_user_log PARTITION(dt)
SELECT
user_id,
action_type,
FROM_UNIXTIME(event_time) AS event_time,
DATE_FORMAT(FROM_UNIXTIME(event_time), 'yyyy-MM-dd') AS dt
FROM ods_raw_log;Case 5: JSON Data Parsing (UDF Application)
Error correction:
Function name : replace the incorrect GET_JSON_JSON_OBJECT with the proper GET_JSON_OBJECT.
SELECT
user_id,
GET_JSON_OBJECT(extra_info, '$.device.model') AS device_model,
GET_JSON_OBJECT(extra_info, '$.location.city') AS city
FROM user_profiles;Case 6: User Retention Rate Calculation
Logic enhancements:
Multi‑day retention : the query can be extended to compute N‑day retention (e.g., 3‑day, 7‑day).
Date range verification : ensure the source table user_behavior contains sufficient historical data.
WITH first_day AS (
SELECT
user_id,
MIN(dt) AS first_dt
FROM user_behavior
GROUP BY user_id
)
SELECT
a.first_dt,
COUNT(DISTINCT a.user_id) AS new_users,
COUNT(DISTINCT b.user_id) AS retained_users,
ROUND(COUNT(DISTINCT b.user_id) / COUNT(DISTINCT a.user_id) * 100, 2) AS retention_rate
FROM first_day a
LEFT JOIN user_behavior b
ON a.user_id = b.user_id
AND b.dt = DATE_ADD(a.first_dt, 7)
GROUP BY a.first_dt;Case 7: Consecutive Login Flag
Principle explanation:
Grouping logic : rows where login_date - row_number yields the same difference belong to a continuous date sequence.
Consecutive days calculation : the derived column consecutive_days directly indicates the length of the continuous streak.
SELECT
user_id,
login_date,
CASE WHEN consecutive_days >= 5 THEN 1 ELSE 0 END AS is_continuous
FROM (
SELECT
user_id,
login_date,
COUNT(*) OVER (PARTITION BY user_id ORDER BY DATE_SUB(login_date, seq)) AS consecutive_days
FROM (
SELECT
user_id,
login_date,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) AS seq
FROM login_records
) t
) t2;Case 8: Path Analysis (Regular Expressions)
Regex optimizations:
Path normalization : apply LOWER(path) to make the analysis case‑insensitive.
Path filtering : exclude empty or invalid paths before aggregation.
SELECT
path,
COUNT(*) AS cnt
FROM (
SELECT
LOWER(REGEXP_EXTRACT(url, '^https?://[^/]+/([^?#]*)', 1)) AS path
FROM click_logs
WHERE dt = '2025-04-15'
AND path != ''
) t
GROUP BY path
ORDER BY cnt DESC
LIMIT 10;Case 9: Multi‑Dimensional Aggregation (CUBE)
Result interpretation: CUBE generates all possible combinations of the listed dimensions (gender, age_group, city), including single‑dimension totals, two‑dimension subtotals, the full three‑dimension cross‑tab, and a grand total where all dimensions are NULL.
SELECT
gender,
age_group,
city,
SUM(amount) AS total_amount
FROM user_orders
GROUP BY CUBE(gender, age_group, city);Case 10: Storage Optimization (ORC + Bloom Filter)
Parameter supplement:
Stripe size : orc.stripe.size=67108864 (64 MB, default).
Bloom filter usage : accelerate WHERE user_id = 'A' style filter queries.
CREATE TABLE optimized_orders (
order_id STRING,
user_id STRING,
amount DOUBLE
) STORED AS ORC
TBLPROPERTIES (
'orc.bloom.filter.columns'='user_id',
'orc.compress'='SNAPPY',
'orc.row.index.stride'='256' -- improves random reads
);
SET hive.merge.smallfiles.avgsize=16777216; -- merge files smaller than 16 MB
INSERT INTO optimized_orders
SELECT * FROM raw_orders;Big Data Tech Team
Focuses on big data, data analysis, data warehousing, data middle platform, data science, Flink, AI and interview experience, side‑hustle earning and career planning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
