Big Data 24 min read

Design and Architecture of MaxCompute Lakehouse Near‑Real‑Time Incremental Processing

This article explains the evolution of Alibaba Cloud's MaxCompute platform into a lakehouse architecture that supports near‑real‑time incremental processing, detailing its development history, core design of transactional tables, five‑module technical stack, data ingestion methods, optimization services, transaction management, query capabilities, ecosystem integration, practical applications, future roadmap, and common user questions.

DataFunTalk

Jun 24, 2023

Design and Architecture of MaxCompute Lakehouse Near‑Real‑Time Incremental Processing

MaxCompute, Alibaba Cloud's self‑developed massive data processing platform, has evolved from a traditional offline data warehouse into a unified lakehouse that offers near‑real‑time incremental processing capabilities.

The platform’s development progressed from focusing solely on formatted data warehousing to supporting diverse external data sources through an external table mechanism, ultimately enabling a lakehouse solution that breaks data silos.

The near‑real‑time incremental processing architecture consists of five key modules: data ingestion, compute engine, data‑optimization service, metadata management, and data‑file organization, all built on a unified storage and compute foundation.

Core to the design is the Transactional Table 2.0 (TT2), which requires a primary key and the transactional property, providing ACID support, upsert, time‑travel queries, bucketed storage, and column‑arised compression for efficient reads and writes.

Data ingestion supports both batch and minute‑level incremental loads via tools such as the MaxCompute Flink Connector, DataWorks integration, MMA, SDK, and SQL, all leveraging the Tunnel Server for high‑concurrency writes.

The compute engine includes the MC SQL engine that parses, optimizes, and executes DDL/DML/DQL statements, with ongoing integration of Spark for extended capabilities.

Data‑optimization services, implemented by the internal Storage Service, provide clustering (small‑file merging) and compaction (merging base and delta files). Both services operate at bucket granularity and coordinate with the Meta Service for transactional consistency.

Transaction management uses an MVCC model with optimistic concurrency control, offering snapshot isolation and conflict detection; helper functions like get_latest_timestamp and get_latest_version simplify time‑travel queries.

Incremental queries employ a BETWEEN … syntax that reads only relevant delta files within the specified time window, automatically excluding files generated by clustering or compaction.

Historical data retention is configurable via acid.data.retain.hours, with a purge command for manual cleanup, and the ecosystem includes integration tools such as DataWorks, Flink Connector, MMA, and MaxCompute SQL.

Key advantages of the new architecture are unified storage/metadata/compute, low storage cost, high query performance, automatic file management, seamless migration from existing MaxCompute workloads, and a fully managed, plug‑and‑play experience.

Practical deployments demonstrate how the architecture resolves Lambda‑style pain points, reduces redundant computation and storage, and offers minute‑level latency while maintaining batch efficiency; future plans include richer SQL features, schema evolution, more ingestion tools, automated minute‑level pipeline scheduling, performance tuning, and broader third‑party engine support.

The Q&A section addresses bucket‑number recommendations, commit versus compaction frequencies, the need for specialized incremental optimizers, and the rollout strategy, emphasizing that users can adopt the new architecture transparently by creating TT2 tables.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data SQL MaxCompute Data Lake Lakehouse Incremental Processing

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.