Big Data 7 min read

How Hash Cluster Tables Slash Shuffle Costs in MaxCompute Pipelines

This article explains how building hash cluster tables in MaxCompute can compress pre‑sorted data, enable shuffle removal, and dramatically reduce execution time and resource consumption for conversion attribution tasks.

Alibaba Cloud Big Data AI Platform

May 16, 2023

How Hash Cluster Tables Slash Shuffle Costs in MaxCompute Pipelines

Overview

Ant Group's conversion attribution initially ran for over two hours and underwent a series of optimizations, notably the creation of hash cluster tables and manual interventions to remove shuffles, which accounted for a large portion of the improvement.

Purpose of Hash Cluster Tables

Store pre‑sorted, compressed data using bucket sort; high duplicate values achieve better compression.

Enable downstream shuffle removal: when downstream operators share the same bucket key strategy, the expensive shuffle can be eliminated, improving performance.

Introduction

The conversion attribution task processes three source tables—access logs (A), click logs (B), and event data (C)—each partitioned into 4096 hash buckets. Data is grouped by user, processed via UDFs to produce per‑user attribution results, which are then expanded and joined with event and log data.

Optimization Process

After merging multiple tasks into a single script, the click and access tables were created as hash cluster tables using the following definition:

CLUSTERED BY (user_id ASC) SORTED BY (user_id ASC,log_id ASC) INTO 4096 BUCKETS

The overall execution graph became much simpler, reducing the number of tasks and runtime. Attempts to hash‑cluster the event table with DISTRIBUTE BY user_id SORT BY scene_type,order_id and setting set odps.sql.reducer.instances=4096 showed no downstream effect, as the feature is not yet supported.

Inserting event data into a cluster table added about three minutes to the pipeline but further streamlined execution, as illustrated in the final execution plan.

Intermediate State

After consolidating tasks, the main tasks (m3 and m4) each used 4096 instances with user‑based bucket distribution, theoretically allowing shuffle removal, but MaxCompute retained shuffles due to loss of certain attributes in complex operations.

Final State

Further refactoring reduced nested SQL, eliminated unnecessary map‑joins, and replaced a small blacklist table with a fixed value, resulting in a single main task with no shuffles.

Execution Result

The optimized pipeline runs in under 20 minutes—about half the original duration—while consuming fewer compute units and memory, delivering conversion attribution results 20 minutes earlier.

Conclusion

When creating hash cluster tables, carefully choose bucket keys; not all join keys need to be bucketed, and storage compression should be considered.

Understanding MaxCompute’s optimization strategies requires practical testing beyond documentation.

MaxCompute already provides intelligent optimizations, and further automation would be beneficial.

Big Data Data Warehouse MaxCompute Hash Clustering Shuffle Removal

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.