How Hash Cluster Tables Slash Shuffle Costs in MaxCompute Pipelines
This article explains how building hash cluster tables in MaxCompute can compress pre‑sorted data, enable shuffle removal, and dramatically reduce execution time and resource consumption for conversion attribution tasks.
Overview
Ant Group's conversion attribution initially ran for over two hours and underwent a series of optimizations, notably the creation of hash cluster tables and manual interventions to remove shuffles, which accounted for a large portion of the improvement.
Purpose of Hash Cluster Tables
Store pre‑sorted, compressed data using bucket sort; high duplicate values achieve better compression.
Enable downstream shuffle removal: when downstream operators share the same bucket key strategy, the expensive shuffle can be eliminated, improving performance.
Introduction
The conversion attribution task processes three source tables—access logs (A), click logs (B), and event data (C)—each partitioned into 4096 hash buckets. Data is grouped by user, processed via UDFs to produce per‑user attribution results, which are then expanded and joined with event and log data.
Optimization Process
After merging multiple tasks into a single script, the click and access tables were created as hash cluster tables using the following definition:
CLUSTERED BY (user_id ASC) SORTED BY (user_id ASC,log_id ASC) INTO 4096 BUCKETSThe overall execution graph became much simpler, reducing the number of tasks and runtime. Attempts to hash‑cluster the event table with DISTRIBUTE BY user_id SORT BY scene_type,order_id and setting set odps.sql.reducer.instances=4096 showed no downstream effect, as the feature is not yet supported.
Inserting event data into a cluster table added about three minutes to the pipeline but further streamlined execution, as illustrated in the final execution plan.
Intermediate State
After consolidating tasks, the main tasks (m3 and m4) each used 4096 instances with user‑based bucket distribution, theoretically allowing shuffle removal, but MaxCompute retained shuffles due to loss of certain attributes in complex operations.
Final State
Further refactoring reduced nested SQL, eliminated unnecessary map‑joins, and replaced a small blacklist table with a fixed value, resulting in a single main task with no shuffles.
Execution Result
The optimized pipeline runs in under 20 minutes—about half the original duration—while consuming fewer compute units and memory, delivering conversion attribution results 20 minutes earlier.
Conclusion
When creating hash cluster tables, carefully choose bucket keys; not all join keys need to be bucketed, and storage compression should be considered.
Understanding MaxCompute’s optimization strategies requires practical testing beyond documentation.
MaxCompute already provides intelligent optimizations, and further automation would be beneficial.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
