Big Data 21 min read

MaxCompute Incremental Update, Processing Architecture, and Intelligent Data Warehouse Optimizations

This article presents a comprehensive overview of MaxCompute's incremental update and processing architecture, the design of intelligent materialized views, and the engine's adaptive execution optimizations, detailing the integrated near‑real‑time and batch pipelines, transactional table 2.0, and practical Q&A.

DataFunTalk

Aug 29, 2023

MaxCompute Incremental Update, Processing Architecture, and Intelligent Data Warehouse Optimizations

The presentation introduces the topic "MaxCompute Incremental Update and Processing Architecture and Intelligent Data Warehouse New Optimizations" and is divided into three major parts: incremental update and processing architecture design, intelligent materialized view evolution and design, and SQL engine adaptive advancements.

1. Incremental Update and Processing Architecture – For most business scenarios, minute‑level or hour‑level incremental processing is sufficient, while second‑level real‑time processing requires separate streaming systems. A Lambda‑style architecture combines MaxCompute offline batch processing with a real‑time incremental layer, but suffers from redundancy and cost issues. The proposed integrated architecture unifies data ingestion, compute engine, storage service, metadata service, and file organization, reducing duplication and improving latency.

The integrated pipeline supports both full‑load and near‑real‑time incremental ingestion via tools such as Flink Connector and DataWorks, leveraging MaxCompute's Tunnel service. It provides upsert and delete interfaces, commit semantics with read‑commit isolation, and supports both batch and real‑time writes.

2. Technical Architecture of the Integrated Pipeline – The architecture consists of five modules: data ingestion layer (supporting diverse sources and the Tunnel service), compute engine layer (MC‑developed SQL engine and third‑party engines handling timetravel and incremental scenarios), storage service layer (handling small‑file clustering, compaction, sorting), metadata service layer (managing transactions, versioning, timetravel), and file organization layer (managing base and delta file formats).

3. Transactional Table 2.0 (TT2) – TT2 introduces a new table type that enables primary‑key based upsert and ACID‑compatible transactions. It stores data in Base Files (compact, columnar) and Delta Files (incremental, timetravel‑enabled). TT2 supports bucketed storage for efficient writes and reads, and provides automatic clustering, compaction, and timetravel queries.

4. Near‑Real‑Time Incremental Write – Implemented via Flink Connector, DataWorks, and the Tunnel SDK, supporting minute‑level concurrent writes with upsert/delete semantics and atomic commit.

5. Batch Write – Achieved through an extended DML syntax that integrates compiler, optimizer, and runtime modifications to handle primary‑key deduplication, upsert construction, and transaction management.

6. Data Organization Optimization Services – Clustering merges small delta files, while Compaction merges delta files into base files, reducing storage overhead and improving query performance.

7. Timetravel and Incremental Query – Timetravel retrieves historical versions by combining the latest base file with subsequent delta files. Incremental queries use open‑closed time intervals to read only relevant delta files, distinguishing them from timetravel queries.

8. Feature Summary – The architecture offers unified storage, metadata, and compute; a full‑stack SQL syntax; deep‑customized data ingestion tools; seamless integration with existing MaxCompute workloads; automated file management; and a fully managed cloud service.

9. Intelligent Materialized View Evolution – Materialized views store pre‑computed results to accelerate queries. The system now supports partition‑penetration, query rewrite, automatic refresh, and a recommendation engine that analyzes job histories, extracts candidate sub‑expressions (favoring aggregates and joins), normalizes them, and evaluates resource impact to suggest optimal materialized views.

10. Adaptive Execution Optimizations – The MaxCompute SQL engine performs multi‑level adaptive optimization: (a) Adaptive execution across compiler, optimizer, and runtime based on static and dynamic statistics; (b) Dynamic DAG adjustments at the plan level (choosing between Shuffle Join and Map Join) and stage level (adjusting concurrency and handling data skew); (c) Worker‑level DAG decisions (selecting Hash Join vs. Merge Join based on data size); (d) Operator‑level adaptive choices (e.g., switching aggregation or sorting algorithms based on runtime characteristics).

11. Q&A Highlights – Differences between materialized views and physical tables, lifecycle management of materialized views, and guidance on choosing between Hash Join and Merge Join based on table sizes and data ordering.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Warehouse MaxCompute

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.