Big Data 12 min read

MaxCompute’s Complex Type Overhaul Boosts Performance Beyond BigQuery

The article examines a real-world migration of a major Southeast Asian tech group from Google BigQuery to Alibaba Cloud MaxCompute, highlighting the challenges of complex data types, the columnar storage and execution engine redesigns, and the resulting performance gains that often surpass BigQuery.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
MaxCompute’s Complex Type Overhaul Boosts Performance Beyond BigQuery

This series article follows the real migration journey of a leading Southeast Asian technology group (referred to as GoTerra) from Google BigQuery to Alibaba Cloud MaxCompute, dissecting key challenges and technical innovations in handling complex data types.

Business Background and Pain Points

With the rapid growth of big data and AI, global data volume has exceeded hundreds of ZB, with semi‑structured (e.g., JSON) and nested data (e.g., protobuf) accounting for over 50%. Traditional processing suffers from high storage costs and low compute efficiency. BigQuery and MaxCompute are leading commercial data platforms, but handling complex types efficiently remains a competitive focus.

During the GoTerra project’s migration from BigQuery to MaxCompute, preserving or surpassing BigQuery’s functionality and performance for complex types was a critical task to ensure smooth job migration and resource cost optimization.

Technical Status

Three typical complex data types are introduced:

Array : a collection of same‑type elements (e.g., array<bigint>), accessed by index, used for lists such as product catalogs.

Map : a key‑value collection (e.g., map<bigint, string>), used for dynamic attributes.

Struct : a composite type with multiple fields (e.g., struct<id:bigint, value:string>), used for user profiles, order info, etc.

MaxCompute currently supports complex types via columnar storage (AliORC) and row‑based computation, but performance gaps remain compared to BigQuery.

Columnar Complex‑Type Computation Refactor

The optimization is split into two phases:

Operators are refined to reduce unnecessary copying and computation on row‑based complex structures.

Row results are transformed into a columnar layout and operators are adapted to columnar computation.

Key improvements include shallow‑copy optimizations for most operators (expressions, aggregates, joins, windows), achieving up to 100× speedups in some cases.

Complex‑Type Columnar Memory Structure

After refactoring, complex‑type data no longer stores each row separately. Instead, a columnar (Arrow‑like) layout stores sub‑elements of a batch contiguously, reducing auxiliary metadata and improving memory access efficiency.

Operator Adaptation to Columnar Complex Types

Three execution modes are defined:

Data Pass‑Through : data is read columnar, not modified, and written columnar, achieving zero‑copy transfer.

Data Append : operators output results directly into columnar structures without further modification.

Data Modification : random updates are not suitable for columnar layout and fall back to row‑based processing.

Unnest‑with‑Subquery Framework Refactor

BigQuery’s UNNEST(array) is frequently used. MaxCompute lacks native support, so it translates to LATERAL VIEW + EXPLODE. Complex subqueries cause heavy plans with repeated table reads and redundant unnest operations.

create table src(a bigint, b array<struct<c:bigint, d:string>>);
select
  (select max(c) from unnest(b)),
  a+100,
  (select collect_list(d) from unnest(b) where d='test')
from src;

The original plan suffers from multiple source reads, repeated unnest, and excessive joins. The refactor introduces a new CorrelatedJoin operator and internal sub‑plan structures to read the source once, push required columns to subqueries, merge identical unnest trees, and eliminate unnecessary shuffles.

SQL Execution Layer Optimizations

The physical plan is adapted to the new operators, implementing the CorrelatedJoin and internal sub‑plan execution framework to handle complex types efficiently.

Deployment Effects and Business Value

Performance case studies show:

Columnar complex‑type processing reduced a stage from >5 minutes to 31 seconds (≈10× speedup).

Unnest‑with‑subquery refactor improved overall job performance by >80 %, with optimized operators gaining >3× speedup.

GoTerra processes over 4 × 10⁴ complex‑type tables and >2 × 10⁵ daily SQL jobs. After the optimizations, most SQLs see 20 %‑10× performance gains, saving >2000 CPU cores per day and enabling a smooth, cost‑effective migration from BigQuery to MaxCompute.

Future work will extend columnar complex‑type optimizations across all MaxCompute scenarios, benefiting millions of jobs and further reducing compute costs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance OptimizationBig DataSQLData WarehouseMaxComputeComplex Types
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.