Big Data 19 min read

Mastering HiveCube: Efficient Multi‑Dimensional Aggregation with Grouping Sets

This article explains how HiveCube can replace traditional development for multi‑dimensional aggregation in a data‑warehouse, covering background, theory of cube, with‑cube/rollup/grouping‑sets syntax, grouping_id handling, practical implementation tips, performance tuning, and a comparison with conventional methods.

Youzan Coder

Jul 1, 2020

Mastering HiveCube: Efficient Multi‑Dimensional Aggregation with Grouping Sets

1. Introduction

Multi‑dimensional analysis is a common downstream application of data‑warehouse systems. The underlying data is stored in cubes that contain aggregated results at various granularities, supporting roll‑up and drill‑down operations. This article focuses on the challenges and experiences of replacing a traditional development model with HiveCube for multi‑dimensional aggregation.

2. Background

In the first half of the year, a project was launched to provide Youzan merchants with a self‑service data extraction feature. Merchants need to preview and download metrics across many dimensions and granularities (store, product, channel, hourly, daily, weekly, monthly, etc.). Although Kylin was initially considered, limited machine resources led to the adoption of HiveCube for more flexible and cost‑effective cube generation.

3. Theory

A cube is a multi‑dimensional data model built on facts and dimensions. HiveCube can generate cubes using three methods: WITH CUBE, WITH ROLLUP, and GROUPING SETS, all of which are supported by SparkSQL.

3.1 WITH CUBE

The WITH CUBE syntax produces aggregation results for every combination of the listed dimensions.

-- with cube syntax
select dim1, dim2, count(*)
from t1
group by dim1, dim2 with cube;

-- equivalent regular syntax
select dim1, dim2, count(*) from t1 group by dim1, dim2
union all
select dim1, null, count(*) from t1 group by dim1, null
union all
select null, dim2, count(*) from t1 group by null, dim2
union all
select null, null, count(*) from t1;

3.2 WITH ROLLUP

The WITH ROLLUP syntax is order‑sensitive and returns aggregation rows only for the first grouping column hierarchy.

-- with rollup syntax
select dim1, dim2, count(*)
from t1
group by dim1, dim2 with rollup;

-- equivalent regular syntax
select dim1, dim2, count(*) from t1 group by dim1, dim2
union all
select dim1, null, count(*) from t1 group by dim1, null
union all
select null, null, count(*) from t1;

3.3 GROUPING SETS

The most flexible method, allowing explicit configuration of required aggregation column combinations.

select dim1, dim2, count(*)
from t1
group by dim1, dim2 grouping sets ((dim1), (dim1, dim2));

Grouping sets can replace the previous two syntaxes:

GROUP BY a, b, c WITH CUBE
is equivalent to
GROUP BY a, b, c GROUPING SETS ((a, b, c), (a, b), (b, c), (a, c), (a), (b), (c), ());

GROUP BY a, b, c WITH ROLLUP
is equivalent to
GROUP BY a, b, c GROUPING SETS ((a, b, c), (a, b), (a), ());

3.4 GROUPING__ID Function

The grouping__id function tags each aggregation level with a unique identifier, enabling easy retrieval of specific granularity data from a cube.

Example data and query:

select dim1, dim2, count(*), grouping__id
from t1
group by dim1, dim2 grouping sets ((dim1, dim2), (dim1), (dim2), ());

The result shows grouping__id values 0, 1, 2, 3 for the four granularities, allowing queries such as WHERE grouping__id = 0 to fetch the (dim1, dim2) aggregation.

3.5 GROUPING__ID Algorithm

The function returns a bit vector indicating the presence of each column in the aggregation. Columns used in the GROUP BY are marked with ‘0’, unused columns with ‘1’. The order of columns determines the bit positions (higher‑order bits for columns nearer the start of the GROUP BY list). The binary vector is then converted to a decimal identifier.

Since Hive 1.0, the grouping__id algorithm matches Spark’s implementation. Before Hive 1.0, differences could cause downstream failures when tasks fall back from Spark to MapReduce, so the method is rarely used in production.

4. Practice

4.1 Implementing a Custom group_id

To avoid platform‑specific differences in grouping__id, a custom group_id can be built by concatenating dimension values that are not NULL.

concat_ws(':',
    case when dim1 is not null then 'dim1' else null end,
    case when dim2 is not null then 'dim2' else null end) as group_id

Querying the (dim1, dim2) aggregation then uses WHERE group_id = 'dim1:dim2', eliminating cross‑platform inconsistencies.

4.2 Integrating Summary Tables with Cube

Example grouping sets for store‑product‑day and store‑product‑channel‑day summaries:

grouping sets (
    (shop_id, goods_id, placed_order_date),
    (shop_id, goods_id, log_src_channel, placed_order_date)
)

Corresponding queries:

-- store‑product‑day
from tmp_cube
where group_id = 'shop_id:goods_id:placed_order_date';

-- store‑product‑channel‑day
from tmp_cube
where group_id = 'shop_id:goods_id:log_src_channel:placed_order_date';

4.3 Cube Data Compression

To reduce storage, time dimensions can be collapsed into flags (e.g., is_placed_1day, is_placed_30day). The cube then produces only the needed rows, dramatically shrinking data volume.

select
    shop_id,
    goods_id,
    case when datediff('${DP_1_DAYS_AGO_Y_m_d}', substr(placed_order_time,1,10)) <= 0
              and datediff('${DP_0_DAYS_AGO_Y_m_d}', substr(placed_order_time,1,10)) > 0 then 1 else 0 end as is_placed_1day,
    ...
from tmp_table
where to_date(placed_order_time) >= '${DP_FIRST_DAY_PRE_MONTH_Y_m_d}'
  and to_date(placed_order_time) < '${DP_0_DAYS_AGO_Y_m_d}'
group by ...
grouping sets (
    (shop_id, goods_id, is_placed_1day, placed_order_hour),
    (shop_id, goods_id, is_placed_1day),
    (shop_id, goods_id, is_placed_weekly),
    (shop_id, goods_id, is_placed_monthly)
);

4.4 Re‑processing (刷数) in Cube

For large‑scale historical back‑fills, copy the production Cube code, adapt it to write results to a temporary table, validate, then insert into downstream summary tables via dynamic partitioning. Cube is a fast aggregation engine, not a data‑retention store.

4.5 HiveCube Parameters

The default maximum number of grouping combinations is 30. It can be increased, e.g., to 100, by setting:

set hive.new.job.grouping.set.cardinality = 100;

4.6 Performance

Benchmark results show HiveCube’s execution time under various data volumes and operator configurations (image omitted for brevity).

4.7 HiveCube vs. Traditional Methods

Comparison across six dimensions:

Code development efficiency

Release convenience

Maintainability

Single‑point failure impact

Re‑processing cost

Resource consumption

Overall, HiveCube offers significant development productivity gains, though it introduces a larger failure surface that can be mitigated with monitoring and stress‑testing.

4.8 Scaling HiveCube

When the number of grouping dimensions grows to hundreds, a single Cube may become untenable. Splitting the Cube into multiple tasks, each handling a subset of grouping sets, distributes load and keeps execution time reasonable.

5. Outlook

Future plans include standardizing the public summary layer of the data‑warehouse using HiveCube, consolidating aggregation logic across business domains, and allowing data‑platform engineers to focus on star‑schema design rather than repetitive cube code.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance Big Data SQL Data Warehouse Hive Cube Grouping Sets

Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.