MaxCompute’s Integrated Offline & Near‑Real‑Time Architecture: Transaction Table 2.0 Explained
This article explains MaxCompute’s new integrated offline‑and‑near‑real‑time architecture, Transaction Table 2.0, detailing its unified storage and compute design, automatic data governance, schema evolution, upsert and time‑travel capabilities, and how it simplifies complex big‑data pipelines while delivering minute‑level latency and lower costs.
As data processing scenarios become more complex, big‑data platforms need both massive storage and efficient batch and near‑real‑time processing. This article introduces MaxCompute’s new offline‑and‑near‑real‑time integrated architecture, Transaction Table 2.0 (TT2), which provides unified storage and compute for both batch and incremental workloads.
Business Background and Current Situation
Traditional large‑scale batch processing can be satisfied by MaxCompute alone, but emerging requirements for low‑latency incremental and near‑real‑time pipelines expose limitations of single‑engine or Lambda architectures.
Three conventional solutions are described:
Pure offline batch with T+1 latency, increasing business logic complexity and storage cost.
A single real‑time engine, which incurs high resource cost and limited scalability for massive batch workloads.
Lambda architecture combining batch and streaming, which suffers from data inconsistency, duplicated storage, architectural complexity, and long development cycles.
New Integrated Architecture
MaxCompute has designed an integrated offline‑and‑near‑real‑time architecture that supports Transaction Table 2.0. It offers minute‑level latency for incremental writes, ACID transaction isolation, automatic small‑file merging, and features such as Upsert, Time‑Travel, and incremental queries, while retaining batch processing efficiency.
Key components include a unified table format, automatic data governance (Auto Sort, Auto Merge, Auto Partial Compact, Auto Clean), and a metadata service that guarantees snapshot isolation.
Table Creation and Key Properties
Example DDL:
createtable tt2 (pk bigint notnull primarykey, val string) tblproperties ("transactional"="true");Important table properties: write.bucket.num – controls the number of buckets per partition or table, affecting parallelism and file size. acid.data.retain.hours – defines the time‑travel retention window (default 1 day, max 7 days).
Schema Evolution
altertable tt2 add columns (val2 string);
altertable tt2 drop columns val;Automatic Data Governance
Four services run automatically:
Auto Sort – converts row‑store Avro files to column‑store AliORC, saving storage and improving read efficiency.
Auto Merge – merges small files without discarding historical records, aiding time‑travel queries.
Auto Partial Compact – compacts files older than the time‑travel window, removing intermediate states and reducing storage cost.
Auto Clean – deletes obsolete files, freeing storage.
Write‑Path Scenarios
Minute‑level upsert pipelines using MaxCompute’s Flink connector achieve 5‑10 minute latency without complex ETL. Partial‑column upserts can be performed via SQL INSERT statements or the Flink connector.
createtable tt2 (pk bigint notnull primarykey, val1 string, val2 string, val3 string) tblproperties ("transactional"="true");
insert into tt2 (pk, val1) select pk, val1 from table1;
insert into tt2 (pk, val2) select pk, val2 from table2;
insert into tt2 (pk, val3) select pk, val3 from table3;SQL DML / Upsert Batch
All DML (INSERT, UPDATE, DELETE, MERGE) are supported. Upsert can be expressed simply with INSERT because the engine merges rows by primary key, simplifying queries and reducing I/O.
Time‑Travel Queries
// query data as of a specific timestamp
select * from tt2 timestamp as of '2024-04-01 01:00:00';
// query the last 5 minutes
select * from tt2 timestamp as of current_timestamp() - 300;
// query up to the second‑most recent commit
select * from tt2 timestamp as of get_latest_timestamp('tt2', 2);Incremental Queries
// incremental data between two timestamps
select * from tt2 timestamp between '2024-04-01 01:00:00' and '2024-04-01 01:10:00';
// incremental data for the last 10 to 5 minutes
select * from tt2 timestamp between current_timestamp() - 601 and current_timestamp() - 300;
// incremental data for the most recent commit
select * from tt2 timestamp between get_latest_timestamp('tt2', 2) and get_latest_timestamp('tt2');PK Point‑Lookup Optimization
Bucket, file, and block pruning dramatically reduce I/O when querying by primary key, often cutting resource consumption by orders of magnitude.
Query Plan Optimizations
Because data are bucketed and sorted by PK, the optimizer can eliminate distinct, shuffle, and sort operators, enabling faster joins and overall performance gains of more than 2× in many cases.
Real‑Time Full‑Database Sync
Instead of traditional hourly ETL, the new architecture allows minute‑level upsert of whole‑database change streams directly into TT2, reducing latency and cost while using a single table.
Future Roadmap
CDC read/write support.
Incremental materialized views.
Second‑level data visibility.
Further governance and query performance enhancements.
More details are available in the official MaxCompute documentation.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
