Big Data 10 min read

Implementing Data Cubes in Hive Using WITH CUBE, GROUPING SETS, and WITH ROLLUP

This article demonstrates how to build multi‑dimensional data cubes on JD's big‑data platform using Hive, comparing UNION ALL with the more concise WITH CUBE, GROUPING SETS, and WITH ROLLUP functions, and discusses practical pitfalls and optimization tips.

JD Tech
JD Tech
JD Tech
Implementing Data Cubes in Hive Using WITH CUBE, GROUPING SETS, and WITH ROLLUP

Based on JD Group's big‑data platform, the article explains how to implement data cubes with Hive, covering general multi‑dimensional batch analysis techniques, performance tuning, and production deployment.

It first describes the data‑cube concept, noting that a naïve UNION ALL approach can lead to redundant code, while Hive's high‑order functions WITH CUBE , GROUPING SETS , and WITH ROLLUP achieve the same result with simpler, maintainable queries.

Sample data is prepared by creating a table and inserting example rows:

CREATE TABLE tmp.tmp_hivecube_test (
  `province` string COMMENT '省份',
  `city` string COMMENT '城市',
  `population` int COMMENT '人口数量'
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS ORC tblproperties (
  'orc.compress' = 'SNAPPY'
);
INSERT INTO tmp.tmp_hivecube_test SELECT '北京市','大兴区',260 UNION ALL SELECT '北京市','通州区',300;

Using UNION ALL, the data cube is built by manually enumerating all dimension combinations:

SELECT province, city, SUM(population) FROM tmp.tmp_hivecube_test GROUP BY province, city UNION ALL
SELECT province, NULL AS city, SUM(population) FROM tmp.tmp_hivecube_test GROUP BY province UNION ALL
SELECT NULL AS province, city, SUM(population) FROM tmp.tmp_hivecube_test GROUP BY city UNION ALL
SELECT NULL AS province, NULL AS city, SUM(population) FROM tmp.tmp_hivecube_test;

The same result can be obtained with a single WITH CUBE statement:

SELECT province, city, SUM(population) FROM tmp.tmp_hivecube_test GROUP BY province, city WITH CUBE;

When the required dimensions are known, GROUPING SETS allows flexible configuration of specific aggregates:

SELECT province, city, SUM(population) FROM tmp.tmp_hivecube_test GROUP BY province, city GROUPING SETS ((province), (province, city));

For hierarchical aggregations, WITH ROLLUP generates cumulative totals from the most detailed level up to the grand total:

SELECT province, city, SUM(population) FROM tmp.tmp_hivecube_test GROUP BY province, city WITH ROLLUP;

The article also highlights a common pitfall: Hive's built‑in grouping_id defaults to 0, causing filters like grouping_id IN (1,3) to return no rows. Disabling predicate push‑down with SET hive.optimize.ppd = false; resolves the issue.

In summary, the author recommends using WITH CUBE with grouping_id for uncertain dimension combinations, GROUPING SETS when dimensions are fixed to save storage, and WITH ROLLUP for clear hierarchical aggregations.

big dataSQLHiverollupData CubeGROUPING SETSWITH CUBE
JD Tech
Written by

JD Tech

Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.