Big Data 23 min read

How MaxCompute’s Lakehouse Architecture Enables Near‑Real‑Time Incremental Processing

This article details Alibaba Cloud MaxCompute’s lakehouse evolution, describing its unified storage‑metadata‑compute design, the Transactional Table 2.0 format, near‑real‑time incremental ingestion, clustering and compaction services, transaction handling, TimeTravel and incremental queries, and future roadmap for big‑data workloads.

Alibaba Cloud Big Data AI Platform

Jun 27, 2023

How MaxCompute’s Lakehouse Architecture Enables Near‑Real‑Time Incremental Processing

MaxCompute Lakehouse Evolution

MaxCompute, Alibaba Cloud’s self‑developed massive data processing platform, has evolved from a pure data‑warehouse solution to a unified lakehouse architecture that supports both offline batch and near‑real‑time incremental processing.

Unified Storage‑Metadata‑Compute Design

The new architecture integrates storage, metadata, and compute engines, delivering low storage cost, efficient data‑file management, and high query performance while supporting ACID transactions, TimeTravel, and upsert capabilities.

Transactional Table 2.0 (TT2)

TT2 introduces a unified table type that combines traditional table features with incremental processing support. By setting a primary key and the transactional=true property, tables gain upsert, ACID, and snapshot isolation capabilities. Additional properties such as write.bucket.num and acid.data.retain.hours control write concurrency and data‑retention periods.

Data Ingestion Ecosystem

Various tools enable high‑throughput data ingestion into TT2, including the MaxCompute Flink Connector, DataWorks integration, and SDKs. These tools leverage the Tunnel Server to achieve minute‑level concurrent writes, supporting both upsert and delete formats.

Near‑Real‑Time Incremental Architecture

The architecture consists of five modules: data ingestion, compute engine, data‑optimization service, metadata management, and data‑file organization. Existing MaxCompute components are reused where possible, while new services handle incremental workloads efficiently.

Core Modules

Data Ingestion : Supports full‑load and minute‑level incremental imports via Flink Connector, DataWorks, and other tools, writing data to bucketed files through the Tunnel Server.

Compute Engine : Extends the native SQL engine to parse and execute incremental DDL/DML, including TimeTravel and upsert operations.

Data‑Optimization Service : Provides automatic small‑file clustering and compaction to merge delta files, reduce file count, and improve I/O efficiency.

Metadata Management : Uses MVCC and OCC models to ensure snapshot isolation and optimistic concurrency control across all operations.

Data‑File Organization : Defines base files for batch reads and delta files for incremental writes, with bucket‑based partitioning and columnar compression.

Clustering and Compaction

Clustering merges small delta files into larger ones based on file size and bucket index, while compaction merges base and delta files to eliminate intermediate update/delete states, producing new base files for fast snapshot queries.

Transaction Management

All write and optimization operations are coordinated by the Meta Service, which implements MVCC for snapshot isolation and OCC for optimistic concurrency, handling conflicts and retries transparently.

TimeTravel and Incremental Queries

TT2 enables TimeTravel queries that retrieve historical versions by locating the nearest base file and merging subsequent delta files. Incremental queries read only delta files within a specified time window, ignoring files generated by clustering or compaction.

Historical Data Retention

The acid.data.retain.hours property controls how long historical versions are kept; older data is automatically purged, with a manual purge command available for special cases.

Data Ingestion Tools

DataWorks data integration for full and incremental sync.

MaxCompute Flink Connector for near‑real‑time upsert.

MaxCompute MMA for large‑scale Hive migration.

Alibaba Cloud Real‑Time Compute Flink Connector.

MaxCompute SDK (not recommended for production).

MaxCompute SQL batch import.

Key Features

Unified storage, metadata, and compute.

Full SQL support with upsert, TimeTravel, and incremental syntax.

Optimized data ingestion tools for complex scenarios.

Seamless integration with existing MaxCompute workloads.

Automatic data‑file management for stability and performance.

Fully managed, ready‑to‑use service with no extra integration cost.

Application Practice and Future Roadmap

The integrated architecture addresses the pain points of Lambda architectures, reducing redundant computation and storage costs while delivering minute‑level latency for incremental workloads. Planned enhancements include richer SQL features, expanded ingestion tools, automated minute‑level pipeline scheduling, further query optimizations, and broader ecosystem integration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data real-time analytics Data Warehouse MaxCompute Lakehouse Incremental Processing

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.