Building a Cloud‑Native Lakehouse with Apache Iceberg and Amoro
This article introduces the background, challenges, and cloud‑native solutions of lakehouse architecture, explains Apache Iceberg’s open table format and its cloud‑native features, details Amoro’s management and self‑optimizing capabilities, showcases three real‑world cloud migration cases, and outlines future development plans.
The article begins with an overview of the evolution from traditional data warehouses to data lakes and finally to integrated lakehouse solutions, highlighting the need for low‑cost storage, support for semi‑structured data, and flexible compute architectures.
It then describes the main characteristics of lakehouse systems, including low‑cost storage, structured data processing with schema evolution, open compute architectures supporting batch, streaming, and graph workloads, and standardized metrics and catalog services.
Apache Iceberg is presented as an open table format that uses decentralized metadata, decouples from HDFS, and provides a standard catalog interface and REST catalog API, making it well‑suited for cloud‑native deployments.
Amoro is introduced as a lakehouse management platform built on top of open table formats like Iceberg, offering catalog services, self‑optimizing mechanisms, and plug‑in optimizer containers that support local, Flink, Yarn, and Kubernetes clusters.
The self‑optimizing feature automatically classifies files into fragment and segment groups, performs mini, magic, and full optimizations to merge small files, reduce delete files, and improve query performance, while managing resources through groups and quotas.
Three cloud‑native lakehouse case studies are detailed: (1) a migration of a Hive‑based system to AWS using Spark SQL, S3, Alluxio, Iceberg, and Amoro; (2) an external company building a lakehouse on AWS S3 with Iceberg, Glue, EMR, and Amoro; (3) a solution using Amoro AMS as the metadata center with Iceberg Rest Catalog on AWS EKS.
Finally, the article outlines Amoro’s future roadmap, including support for additional lake formats (Paimon, Hudi), dynamic optimize scheduling, standard command‑line tools for data access, and a unified permission model integrating with Ranger, AWS, and other cloud providers.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.