MaxCompute’s Complex Type Overhaul Boosts Performance Beyond BigQuery
The article examines a real-world migration of a major Southeast Asian tech group from Google BigQuery to Alibaba Cloud MaxCompute, highlighting the challenges of complex data types, the columnar storage and execution engine redesigns, and the resulting performance gains that often surpass BigQuery.
This series article follows the real migration journey of a leading Southeast Asian technology group (referred to as GoTerra) from Google BigQuery to Alibaba Cloud MaxCompute, dissecting key challenges and technical innovations in handling complex data types.
Business Background and Pain Points
With the rapid growth of big data and AI, global data volume has exceeded hundreds of ZB, with semi‑structured (e.g., JSON) and nested data (e.g., protobuf) accounting for over 50%. Traditional processing suffers from high storage costs and low compute efficiency. BigQuery and MaxCompute are leading commercial data platforms, but handling complex types efficiently remains a competitive focus.
During the GoTerra project’s migration from BigQuery to MaxCompute, preserving or surpassing BigQuery’s functionality and performance for complex types was a critical task to ensure smooth job migration and resource cost optimization.
Technical Status
Three typical complex data types are introduced:
Array : a collection of same‑type elements (e.g., array<bigint>), accessed by index, used for lists such as product catalogs.
Map : a key‑value collection (e.g., map<bigint, string>), used for dynamic attributes.
Struct : a composite type with multiple fields (e.g., struct<id:bigint, value:string>), used for user profiles, order info, etc.
MaxCompute currently supports complex types via columnar storage (AliORC) and row‑based computation, but performance gaps remain compared to BigQuery.
Columnar Complex‑Type Computation Refactor
The optimization is split into two phases:
Operators are refined to reduce unnecessary copying and computation on row‑based complex structures.
Row results are transformed into a columnar layout and operators are adapted to columnar computation.
Key improvements include shallow‑copy optimizations for most operators (expressions, aggregates, joins, windows), achieving up to 100× speedups in some cases.
Complex‑Type Columnar Memory Structure
After refactoring, complex‑type data no longer stores each row separately. Instead, a columnar (Arrow‑like) layout stores sub‑elements of a batch contiguously, reducing auxiliary metadata and improving memory access efficiency.
Operator Adaptation to Columnar Complex Types
Three execution modes are defined:
Data Pass‑Through : data is read columnar, not modified, and written columnar, achieving zero‑copy transfer.
Data Append : operators output results directly into columnar structures without further modification.
Data Modification : random updates are not suitable for columnar layout and fall back to row‑based processing.
Unnest‑with‑Subquery Framework Refactor
BigQuery’s UNNEST(array) is frequently used. MaxCompute lacks native support, so it translates to LATERAL VIEW + EXPLODE. Complex subqueries cause heavy plans with repeated table reads and redundant unnest operations.
create table src(a bigint, b array<struct<c:bigint, d:string>>);
select
(select max(c) from unnest(b)),
a+100,
(select collect_list(d) from unnest(b) where d='test')
from src;The original plan suffers from multiple source reads, repeated unnest, and excessive joins. The refactor introduces a new CorrelatedJoin operator and internal sub‑plan structures to read the source once, push required columns to subqueries, merge identical unnest trees, and eliminate unnecessary shuffles.
SQL Execution Layer Optimizations
The physical plan is adapted to the new operators, implementing the CorrelatedJoin and internal sub‑plan execution framework to handle complex types efficiently.
Deployment Effects and Business Value
Performance case studies show:
Columnar complex‑type processing reduced a stage from >5 minutes to 31 seconds (≈10× speedup).
Unnest‑with‑subquery refactor improved overall job performance by >80 %, with optimized operators gaining >3× speedup.
GoTerra processes over 4 × 10⁴ complex‑type tables and >2 × 10⁵ daily SQL jobs. After the optimizations, most SQLs see 20 %‑10× performance gains, saving >2000 CPU cores per day and enabling a smooth, cost‑effective migration from BigQuery to MaxCompute.
Future work will extend columnar complex‑type optimizations across all MaxCompute scenarios, benefiting millions of jobs and further reducing compute costs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
