Big Data 27 min read

How MaxCompute’s New Offline‑Near‑Real‑Time Architecture Revolutionizes Big Data Workloads

This article explains how MaxCompute’s integrated offline‑and‑near‑real‑time architecture, built on Delta Table, solves complex big‑data scenarios by providing unified storage, ACID transactions, upsert, time‑travel, automatic data‑file governance and low‑latency query capabilities while reducing cost and operational complexity.

Alibaba Cloud Developer

May 27, 2024

How MaxCompute’s New Offline‑Near‑Real‑Time Architecture Revolutionizes Big Data Workloads

Business Background and Current Situation

Traditional big‑data workloads on MaxCompute handle large‑scale batch processing well, but emerging use cases demand low‑latency, near‑real‑time pipelines and incremental processing, exposing limitations of pure offline or Lambda architectures.

Limitations of Existing Solutions

Three typical approaches—offline batch only, real‑time engine only, or Lambda architecture—suffer from high cost, poor latency, data duplication, and operational complexity.

New Integrated Architecture

MaxCompute now offers an offline‑and‑near‑real‑time unified architecture using Delta Table, which supports both batch and incremental workloads, provides ACID transactions, upsert, time‑travel, and automatic data‑file optimization (auto‑sort, auto‑merge, auto‑compact, auto‑clean).

Delta Table Basics

Delta Table stores data in a unified format that enables minute‑level incremental writes and large‑scale batch queries. Creating a table requires a primary key and the transactional property set to true.

createtable dt (pk bigint notnull primarykey, val string) tblproperties ("transactional"="true");

Key Table Properties

Important properties include write.bucket.num (controls bucket count per partition) and acid.data.retain.hours (time‑travel retention). Proper tuning balances storage cost, query performance, and concurrency.

createtable dt (pk bigint notnull primarykey, val string) tblproperties ("transactional"="true", "write.bucket.num"="32", "acid.data.retain.hours"="48");

Schema Evolution

Delta Table supports adding or dropping columns via alter table statements while preserving historical versions.

altertable dt add columns (val2 string);</code><code>altertable dt drop columns val;

Automatic Data Governance

The service automatically performs:

Auto Sort – converts incoming rows to columnar format.

Auto Merge – merges small files without removing historical records.

Auto Partial Compact – compacts files and removes intermediate states for older versions.

Auto Clean – deletes obsolete files.

Upsert and Partial‑Column Updates

Minute‑level upsert pipelines write directly to Delta Table, achieving 5‑10 minute latency without complex ETL. Partial‑column updates can be performed via sequential INSERT INTO statements or via the Flink connector.

createtable dt (pk bigint notnull primarykey, val1 string, val2 string, val3 string) tblproperties ("transactional"="true");</code><code>insert into dt (pk, val1) select pk, val1 from table1;</code><code>insert into dt (pk, val2) select pk, val2 from table2;</code><code>insert into dt (pk, val3) select pk, val3 from table3;

SQL DML and Query Capabilities

MaxCompute supports full DQL/DML syntax, including upsert, merge, delete, and time‑travel queries. Time‑travel allows querying historical snapshots using timestamps or commit IDs.

// query data as of a specific timestamp
select * from dt timestampasof '2024-04-01 01:00:00';

// query data from the last 5 minutes
select * from dt timestampasof current_timestamp() - 300;

// query the second‑latest commit
select * from dt timestampasof get_latest_timestamp('dt', 2);

Incremental Query Support

Two syntaxes are available: explicit timestamp/version range and automatic version tracking for streaming use cases.

// explicit range
select * from dt timestampbetween '2024-04-01 01:00:00' and '2024-04-01 01:10:00';

// automatic streaming
create stream dt_stream on table dt;
insert into dt values (1,'a'), (2,'b');
insert overwrite dest select * from dt_stream;

PK Point‑Lookup Optimization

Queries filtering on the primary key benefit from bucket pruning, file pruning, and block pruning, dramatically reducing I/O.

select * from dt where pk = 1;

SQL Optimizer Enhancements

Because data is bucketed and sorted by PK, the optimizer can eliminate distinct, shuffle, and sort operators, enabling faster joins and aggregations.

select * from (select distinct pk from dt_t1) t join (select distinct pk from dt_t2) t2 on t.pk = t2.pk;

Real‑Time Database Sync

MaxCompute now supports direct minute‑level upsert of whole‑database change streams, replacing the previous multi‑step ETL process.

Advantages

Low cost for near‑real‑time and incremental workloads.

Unified storage, metadata, and compute engine.

Full SQL support with ACID guarantees.

Highly optimized data‑import tools.

Seamless migration from existing MaxCompute workloads.

Automatic data governance ensures stability and performance.

Fully managed, zero‑setup deployment.

Production Status and Future Roadmap Over 100 projects and 700+ Delta tables are in production on Alibaba Cloud. Upcoming features include CDC read/write, incremental materialized views, sub‑second visibility, and deeper data‑service optimizations. References Transaction Table Overview Basic Operations Table Operations Flink Connector Exclusive Resource Groups Insert/Update Guide Time Travel

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Warehouse MaxCompute near real-time Delta Table

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.