Databases 14 min read

How AnalyticDB MySQL Achieves Elastic Scaling with Compute‑Storage Separation

This article explains how AnalyticDB MySQL’s elastic mode separates compute and storage to improve scalability, describes industry trends, details its architecture layers, presents performance test results on TPCH Q1, and outlines future enhancements for cloud‑native data warehousing.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How AnalyticDB MySQL Achieves Elastic Scaling with Compute‑Storage Separation

Background

With the rise of the cloud era, databases have embraced cloud-native architectures. Various database systems (OLTP, OLAP, NoSQL) run on public clouds such as AWS, Azure, and Alibaba Cloud, including open‑source MySQL, PostgreSQL, MongoDB, as well as proprietary offerings like SQL Server, Oracle, Aurora, Redshift, PolarDB, AnalyticDB, and AzureSQL. Some are merely hosted on cloud VMs (Cloud Hosting), while others have evolved to true cloud‑native designs that leverage elastic, serverless, and data‑sharing capabilities.

Why Separate Compute and Storage

Massive Parallel Processing (MPP) is the dominant architecture for OLAP databases. In MPP, each node combines CPU, memory, and disk, processing its own data partition without resource contention, which yields good parallel performance. However, tightly coupling compute and storage makes it difficult to satisfy diverse workloads: data‑ingestion tasks consume high I/O and network bandwidth but little CPU, while complex queries demand heavy CPU. Scaling CPU in a coupled architecture forces data reshuffling, consuming additional network and CPU resources, limiting elasticity.

Separating compute from storage allows independent sizing of each layer, enabling rapid scaling or shrinking of compute resources without costly data movement, and lets each layer exploit its own strengths.

Industry Trends

1. Redshift – AWS’s popular data warehouse uses MPP and has added elastic features. In November 2018 it introduced Elastic Resize, reducing scaling time, followed by Elastic Resize Scheduling in 2019. In December 2019 Redshift launched the RA3 architecture, separating compute (SSD cache) from storage (S3), providing independent elasticity.

2. Snowflake – From its inception Snowflake adopted a compute‑storage separation model across clouds. Storage resides in object stores (S3, Azure Blob), while compute runs in virtual warehouses (clusters of EC2 instances). This design enables per‑workload scaling, auto‑suspend/resume, and a pay‑as‑you‑go experience.

AnalyticDB Elastic Mode

AnalyticDB MySQL initially followed a traditional MPP design. In May 2020 it introduced an elastic mode based on compute‑storage separation, comprising three layers: an access layer (MySQL‑compatible, handling permissions, optimizer, metadata, and query scheduling), a compute layer, and a storage layer.

Storage Layer

The storage layer handles real‑time writes, index building, scanning, and predicate push‑down (filter, column pruning, partition pruning). Data remains organized with MPP‑style sharding (hash/random), enabling strong consistency for writes and shard‑level concurrent reads. It also provides integrated hot‑cold tiered storage: hot tables reside on local SSD, cold tables on underlying DFS, with automatic migration between tiers.

Compute Layer

The compute layer consists of multiple compute nodes that receive physical execution plans from the access layer, translate them into operators, and execute using a vectorized model. Data flows between operators in batches (typically thousands of rows) stored column‑wise. A JIT module generates native code for expressions, sorting, and type comparisons, caching compiled patterns to reduce overhead for interactive queries.

Execution Plan

In the separated architecture, a new Resharding operator loads data from storage. Data is transferred in columnar batches (≤32 MB per request). The optimizer leverages the storage layer’s pre‑partitioning to minimize unnecessary repartitioning during joins and aggregations, and pushes down filters that can be satisfied by storage‑layer indexes, reducing data movement.

Dynamic Partition Redistribution

Resharding and Scan operators redistribute partitions according to three principles:

Partitions from the same storage node are spread across different compute nodes.

For a single query, identical partitions from different tables are co‑located on the same compute node.

Across different queries, the same partition is randomly assigned to compute nodes.

Unlike Snowflake and Redshift, AnalyticDB does not bind partitions to fixed compute nodes; compute nodes rely on storage‑layer SSD and memory caches for fast access, and dynamic redistribution mitigates hotspot and data‑skew issues.

Data Loading Optimizations

Merge network connections to reduce round‑trips for small queries, lowering latency.

Compress batch data using columnar encoding to save bandwidth and improve Resharding throughput.

Asynchronous reads: network module pre‑loads data into buffers, allowing the Resharding operator to consume data while CPU and I/O run in parallel.

Performance Test

We evaluated the impact of compute‑storage separation on AnalyticDB’s query throughput using TPCH Q1 (a single‑table aggregation) on a 1 TB dataset. The test measured average execution time under single‑threaded sequential execution.

Test Environment

Instance 1 (non‑separated): 4 storage nodes handling both scanning and computation (4 × 3 × 8‑core).

Instance 2 (elastic mode): 4 storage nodes + 6 compute nodes (storage: 4 × 3 × 8‑core; compute: 6 × 16‑core).

Test Scenario

TPCH Q1 selects and aggregates columns from the lineitem table with a date filter, transferring roughly 260 GB from storage to compute.

select
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) as sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    count(*) as count_order
from lineitem
where l_shipdate <= date '1998-12-01' - interval '120' day
group by l_returnflag, l_linestatus
order by l_returnflag, l_linestatus;

Test Data

Mode

TPCH Q1

Storage Node CPU

Compute Node CPU

Non‑separated

83 s

98%

Elastic

81 s

19.5%

97%

Test Conclusion

The elastic mode achieved slightly better execution time. Although separating compute and storage adds network and serialization overhead, the total CPU consumption (19.5% + 97% = 116.5%) is only 1.19× the non‑separated mode’s 98%, indicating that the extra cost is modest. As long as the storage layer provides sufficient throughput, compute‑storage separation does not degrade query throughput and offers greater elasticity.

Summary

Building on AnalyticDB’s elastic mode, future work will deepen elasticity through compute resource pooling, on‑demand scaling, and rapid storage‑layer expansion using shared storage, further reducing customer costs while meeting cloud data‑warehouse demands.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Data WarehouseAnalyticDBelastic scalingCompute-Storage Separation
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.