Cloud Native 12 min read

Building a Cloud‑Native Lakehouse with Apache Iceberg and Amoro

This article introduces the background, challenges, and cloud‑native solutions of lakehouse architecture, explains Apache Iceberg’s open table format and its cloud‑native features, details Amoro’s management and self‑optimizing capabilities, showcases three real‑world cloud migration cases, and outlines future development plans.

DataFunSummit

Dec 20, 2023

Building a Cloud‑Native Lakehouse with Apache Iceberg and Amoro

The article begins with an overview of the evolution from traditional data warehouses to data lakes and finally to integrated lakehouse solutions, highlighting the need for low‑cost storage, support for semi‑structured data, and flexible compute architectures.

It then describes the main characteristics of lakehouse systems, including low‑cost storage, structured data processing with schema evolution, open compute architectures supporting batch, streaming, and graph workloads, and standardized metrics and catalog services.

Apache Iceberg is presented as an open table format that uses decentralized metadata, decouples from HDFS, and provides a standard catalog interface and REST catalog API, making it well‑suited for cloud‑native deployments.

Amoro is introduced as a lakehouse management platform built on top of open table formats like Iceberg, offering catalog services, self‑optimizing mechanisms, and plug‑in optimizer containers that support local, Flink, Yarn, and Kubernetes clusters.

The self‑optimizing feature automatically classifies files into fragment and segment groups, performs mini, magic, and full optimizations to merge small files, reduce delete files, and improve query performance, while managing resources through groups and quotas.

Three cloud‑native lakehouse case studies are detailed: (1) a migration of a Hive‑based system to AWS using Spark SQL, S3, Alluxio, Iceberg, and Amoro; (2) an external company building a lakehouse on AWS S3 with Iceberg, Glue, EMR, and Amoro; (3) a solution using Amoro AMS as the metadata center with Iceberg Rest Catalog on AWS EKS.

Finally, the article outlines Amoro’s future roadmap, including support for additional lake formats (Paimon, Hudi), dynamic optimize scheduling, standard command‑line tools for data access, and a unified permission model integrating with Ranger, AWS, and other cloud providers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Management Apache Iceberg Lakehouse Amoro Self‑optimizing

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.