Big Data 9 min read

10 Advanced Hive SQL Use Cases: Windows, Skew, JSON, and More

This article presents ten practical Hive SQL scenarios—including window functions for ranking, LAG for time‑interval analysis, random‑salt techniques to mitigate data skew, dynamic partition writes, JSON parsing with UDFs, retention calculations, consecutive‑login detection, regex‑based path analysis, CUBE multi‑dimensional aggregation, and ORC storage optimizations—each accompanied by optimization tips and complete code examples.

Big Data Tech Team

Apr 27, 2025

10 Advanced Hive SQL Use Cases: Windows, Skew, JSON, and More

Case 1: User Behavior Ranking (Window Functions)

Optimization points:

Handling ties : replace the default ranking with RANK() or DENSE_RANK() when duplicate rankings are needed (see references).

Performance : add PARTITION BY dt to group calculations by day and improve efficiency.

SELECT
  user_id,
  dt,
  visit_count,
  RANK() OVER (PARTITION BY user_id ORDER BY visit_count DESC) AS rank
FROM (
  SELECT
    user_id,
    dt,
    COUNT(*) AS visit_count
  FROM user_behavior
  WHERE dt BETWEEN '2025-04-01' AND '2025-04-07'
  GROUP BY user_id, dt
) t;

Case 2: Time‑Interval Analysis (LAG Function)

Optimization points:

NULL handling : use COALESCE to replace a NULL value returned by LAG(login_time) on the first login.

Date validation : ensure login_time is of DATE or TIMESTAMP type.

SELECT
  user_id,
  login_time,
  COALESCE(LAG(login_time) OVER (PARTITION BY user_id ORDER BY login_time), login_time) AS prev_login,
  DATEDIFF(login_time, LAG(login_time) OVER (PARTITION BY user_id ORDER BY login_time)) AS days_diff
FROM user_login_log;

Case 3: Data Skew Optimization (Random Salt Method)

Supplementary notes:

Salt quantity : generate ten possible salts with RAND()*10; adjust based on the distribution of the skewed key.

Parameter tuning : enable automatic skew handling by setting hive.optimize.skewjoin=true (see reference).

SET hive.optimize.skewjoin=true;
SELECT /*+ MAPJOIN(b_sub) */
  a.order_id,
  b.user_id
FROM big_table a
JOIN (
  SELECT
    user_id,
    CONCAT(skew_key, '_', CAST(RAND()*10 AS INT)) AS skew_key_salt
  FROM skewed_table
) b_sub
ON a.skew_key = b_sub.skew_key_salt;

Case 4: Dynamic Partition Write

Common error fixes:

Partition field check : ensure the partition column dt appears in the SELECT list.

Partition type conversion : if dt is a string, convert it with DATE_FORMAT before inserting.

SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT INTO TABLE dwd_user_log PARTITION(dt)
SELECT
  user_id,
  action_type,
  FROM_UNIXTIME(event_time) AS event_time,
  DATE_FORMAT(FROM_UNIXTIME(event_time), 'yyyy-MM-dd') AS dt
FROM ods_raw_log;

Case 5: JSON Data Parsing (UDF Application)

Error correction:

Function name : replace the incorrect GET_JSON_JSON_OBJECT with the proper GET_JSON_OBJECT.

SELECT
  user_id,
  GET_JSON_OBJECT(extra_info, '$.device.model') AS device_model,
  GET_JSON_OBJECT(extra_info, '$.location.city') AS city
FROM user_profiles;

Case 6: User Retention Rate Calculation

Logic enhancements:

Multi‑day retention : the query can be extended to compute N‑day retention (e.g., 3‑day, 7‑day).

Date range verification : ensure the source table user_behavior contains sufficient historical data.

WITH first_day AS (
  SELECT
    user_id,
    MIN(dt) AS first_dt
  FROM user_behavior
  GROUP BY user_id
)
SELECT
  a.first_dt,
  COUNT(DISTINCT a.user_id) AS new_users,
  COUNT(DISTINCT b.user_id) AS retained_users,
  ROUND(COUNT(DISTINCT b.user_id) / COUNT(DISTINCT a.user_id) * 100, 2) AS retention_rate
FROM first_day a
LEFT JOIN user_behavior b
  ON a.user_id = b.user_id
  AND b.dt = DATE_ADD(a.first_dt, 7)
GROUP BY a.first_dt;

Case 7: Consecutive Login Flag

Principle explanation:

Grouping logic : rows where login_date - row_number yields the same difference belong to a continuous date sequence.

Consecutive days calculation : the derived column consecutive_days directly indicates the length of the continuous streak.

SELECT
  user_id,
  login_date,
  CASE WHEN consecutive_days >= 5 THEN 1 ELSE 0 END AS is_continuous
FROM (
  SELECT
    user_id,
    login_date,
    COUNT(*) OVER (PARTITION BY user_id ORDER BY DATE_SUB(login_date, seq)) AS consecutive_days
  FROM (
    SELECT
      user_id,
      login_date,
      ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) AS seq
    FROM login_records
  ) t
) t2;

Case 8: Path Analysis (Regular Expressions)

Regex optimizations:

Path normalization : apply LOWER(path) to make the analysis case‑insensitive.

Path filtering : exclude empty or invalid paths before aggregation.

SELECT
  path,
  COUNT(*) AS cnt
FROM (
  SELECT
    LOWER(REGEXP_EXTRACT(url, '^https?://[^/]+/([^?#]*)', 1)) AS path
  FROM click_logs
  WHERE dt = '2025-04-15'
    AND path != ''
) t
GROUP BY path
ORDER BY cnt DESC
LIMIT 10;

Case 9: Multi‑Dimensional Aggregation (CUBE)

Result interpretation: CUBE generates all possible combinations of the listed dimensions (gender, age_group, city), including single‑dimension totals, two‑dimension subtotals, the full three‑dimension cross‑tab, and a grand total where all dimensions are NULL.

SELECT
  gender,
  age_group,
  city,
  SUM(amount) AS total_amount
FROM user_orders
GROUP BY CUBE(gender, age_group, city);

Case 10: Storage Optimization (ORC + Bloom Filter)

Parameter supplement:

Stripe size : orc.stripe.size=67108864 (64 MB, default).

Bloom filter usage : accelerate WHERE user_id = 'A' style filter queries.

CREATE TABLE optimized_orders (
  order_id STRING,
  user_id STRING,
  amount DOUBLE
) STORED AS ORC
TBLPROPERTIES (
  'orc.bloom.filter.columns'='user_id',
  'orc.compress'='SNAPPY',
  'orc.row.index.stride'='256' -- improves random reads
);

SET hive.merge.smallfiles.avgsize=16777216; -- merge files smaller than 16 MB
INSERT INTO optimized_orders
SELECT * FROM raw_orders;

Performance optimization SQL Data Warehouse Hive

Written by

Big Data Tech Team

Focuses on big data, data analysis, data warehousing, data middle platform, data science, Flink, AI and interview experience, side‑hustle earning and career planning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.