Building a Lakehouse on Alibaba Cloud AnalyticDB (ADB) with Apache Hudi: Architecture, Challenges, and Practices
This article presents a comprehensive technical overview of Alibaba Cloud AnalyticDB's Lakehouse edition, detailing its unified architecture, key advantages, the challenges of ingesting billions of records with Apache Hudi, and the engineering solutions—including Flink integration, hotspot mitigation, memory optimization, OSS throttling handling, concurrent write support, lifecycle management, and TableService—that enable a cost‑effective, high‑performance lake‑to‑warehouse platform.
Li Shaofeng, a database expert from Alibaba Cloud, introduces the ADB Lakehouse edition, which unifies data collection, storage, computation, management, and application layers, offering low‑cost, high‑throughput storage, elastic resources, and a unified metadata service.
The architecture provides five upgraded capabilities: a one‑click data pipeline (APS) for easy ingestion, Hudi‑based storage supporting both offline and online workloads, an enhanced XIHE/BSP SQL engine plus Spark for complex processing, a unified metadata and permission service, and AI‑enabled analytics via Spark.
Key advantages include resource and experience integration, enabling a single data copy to serve both batch and real‑time scenarios with elastic scaling and cost reduction.
When building a lakehouse with Hudi, the team faced challenges such as 4 GB/s ingestion throughput, severe data skew, massive scan volumes, and limited elasticity of traditional warehouses.
The solution architecture uses SLS as the source, Flink for real‑time processing, and Hudi for storage, with a coordinated commit protocol ensuring exactly‑once semantics and automatic fallback between MPP and BSP execution modes.
To address hotspot and OOM issues, a hotspot‑shuffling mechanism, memory‑efficient Parquet writing, and OSS request optimizations (timeline‑based checkpoint metadata, SDK tweaks, and marker‑file caching) were implemented.
Concurrent write support was achieved by isolating checkpoint metadata, view storage, and table service instances, adding retry strategies for instant generation, and ensuring table‑service isolation.
Lifecycle management features allow partition‑level data retention based on size, count, or expiration time, supporting concurrent policy updates.
The independent TableService handles background compaction, commit cleanup, and asynchronous clustering, improving query performance by over 40% while keeping table state size under control.
In practice, the ADB Lakehouse combines APS for low‑cost, low‑latency ingestion, Spark for batch and ML workloads, and zero‑ETL integration that lets users query ADB tables directly from Spark without data movement.
Overall, the ADB Lakehouse provides a cloud‑native, one‑stop data analysis platform that bridges the gap between data lakes and warehouses, delivering high performance, elasticity, and cost efficiency.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
