How MaxCompute’s New Offline‑Near‑Real‑Time Architecture Revolutionizes Big Data Workloads
This article explains how MaxCompute’s integrated offline‑and‑near‑real‑time architecture, built on Delta Table, solves complex big‑data scenarios by providing unified storage, ACID transactions, upsert, time‑travel, automatic data‑file governance and low‑latency query capabilities while reducing cost and operational complexity.
Business Background and Current Situation
Traditional big‑data workloads on MaxCompute handle large‑scale batch processing well, but emerging use cases demand low‑latency, near‑real‑time pipelines and incremental processing, exposing limitations of pure offline or Lambda architectures.
Limitations of Existing Solutions
Three typical approaches—offline batch only, real‑time engine only, or Lambda architecture—suffer from high cost, poor latency, data duplication, and operational complexity.
New Integrated Architecture
MaxCompute now offers an offline‑and‑near‑real‑time unified architecture using Delta Table, which supports both batch and incremental workloads, provides ACID transactions, upsert, time‑travel, and automatic data‑file optimization (auto‑sort, auto‑merge, auto‑compact, auto‑clean).
Delta Table Basics
Delta Table stores data in a unified format that enables minute‑level incremental writes and large‑scale batch queries. Creating a table requires a primary key and the transactional property set to true.
createtable dt (pk bigint notnull primarykey, val string) tblproperties ("transactional"="true");Key Table Properties
Important properties include write.bucket.num (controls bucket count per partition) and acid.data.retain.hours (time‑travel retention). Proper tuning balances storage cost, query performance, and concurrency.
createtable dt (pk bigint notnull primarykey, val string) tblproperties ("transactional"="true", "write.bucket.num"="32", "acid.data.retain.hours"="48");Schema Evolution
Delta Table supports adding or dropping columns via alter table statements while preserving historical versions.
altertable dt add columns (val2 string);</code><code>altertable dt drop columns val;Automatic Data Governance
The service automatically performs:
Auto Sort – converts incoming rows to columnar format.
Auto Merge – merges small files without removing historical records.
Auto Partial Compact – compacts files and removes intermediate states for older versions.
Auto Clean – deletes obsolete files.
Upsert and Partial‑Column Updates
Minute‑level upsert pipelines write directly to Delta Table, achieving 5‑10 minute latency without complex ETL. Partial‑column updates can be performed via sequential INSERT INTO statements or via the Flink connector.
createtable dt (pk bigint notnull primarykey, val1 string, val2 string, val3 string) tblproperties ("transactional"="true");</code><code>insert into dt (pk, val1) select pk, val1 from table1;</code><code>insert into dt (pk, val2) select pk, val2 from table2;</code><code>insert into dt (pk, val3) select pk, val3 from table3;SQL DML and Query Capabilities
MaxCompute supports full DQL/DML syntax, including upsert, merge, delete, and time‑travel queries. Time‑travel allows querying historical snapshots using timestamps or commit IDs.
// query data as of a specific timestamp
select * from dt timestampasof '2024-04-01 01:00:00';
// query data from the last 5 minutes
select * from dt timestampasof current_timestamp() - 300;
// query the second‑latest commit
select * from dt timestampasof get_latest_timestamp('dt', 2);Incremental Query Support
Two syntaxes are available: explicit timestamp/version range and automatic version tracking for streaming use cases.
// explicit range
select * from dt timestampbetween '2024-04-01 01:00:00' and '2024-04-01 01:10:00';
// automatic streaming
create stream dt_stream on table dt;
insert into dt values (1,'a'), (2,'b');
insert overwrite dest select * from dt_stream;PK Point‑Lookup Optimization
Queries filtering on the primary key benefit from bucket pruning, file pruning, and block pruning, dramatically reducing I/O.
select * from dt where pk = 1;SQL Optimizer Enhancements
Because data is bucketed and sorted by PK, the optimizer can eliminate distinct, shuffle, and sort operators, enabling faster joins and aggregations.
select * from (select distinct pk from dt_t1) t join (select distinct pk from dt_t2) t2 on t.pk = t2.pk;Real‑Time Database Sync
MaxCompute now supports direct minute‑level upsert of whole‑database change streams, replacing the previous multi‑step ETL process.
Advantages
Low cost for near‑real‑time and incremental workloads.
Unified storage, metadata, and compute engine.
Full SQL support with ACID guarantees.
Highly optimized data‑import tools.
Seamless migration from existing MaxCompute workloads.
Automatic data governance ensures stability and performance.
Fully managed, zero‑setup deployment.
Production Status and Future Roadmap Over 100 projects and 700+ Delta tables are in production on Alibaba Cloud. Upcoming features include CDC read/write, incremental materialized views, sub‑second visibility, and deeper data‑service optimizations. References Transaction Table Overview Basic Operations Table Operations Flink Connector Exclusive Resource Groups Insert/Update Guide Time Travel
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
