Build a Cloud‑Native Lakehouse on AWS with Apache Iceberg and Amoro
This guide explains the cloud‑native lakehouse concept, outlines its advantages and challenges, compares lake‑table projects such as Iceberg, and provides a step‑by‑step AWS deployment of Apache Iceberg and Amoro—including environment setup, AMS installation, catalog configuration, optimizer launch, data ingestion with Flink, and query verification with Spark.
Introduction
The cloud‑native lakehouse combines the elasticity, low‑cost storage, and reduced operational overhead of cloud platforms with the unified analytics capabilities of a data warehouse. It enables a single storage system to handle structured, semi‑structured, and unstructured data while supporting both batch and streaming workloads.
What Is a Lakehouse?
A lakehouse merges data‑lake storage and data‑warehouse processing. Traditional warehouses focus on structured data, but emerging AI, streaming, and machine‑learning workloads demand support for diverse data formats. Data lakes provide cheap, open storage, yet lack ACID guarantees, schema evolution, and efficient updates. Projects such as Apache Iceberg , Apache Hudi , and Delta Lake address these gaps with features like:
ACID support for concurrent reads and writes.
Schema evolution without breaking constraints.
Efficient updates/deletes beyond append‑only writes.
Streaming integration to avoid separate real‑time systems.
OLAP query optimization via file‑skip capabilities.
Data time‑travel for rollback and AI training.
Although lake tables solve many problems, a production‑ready lakehouse still requires metadata management, continuous optimization, and unified access across multiple table formats.
Cloud‑Native Lakehouse
Key cloud characteristics—object storage, compute‑storage separation, elastic scaling, and low‑cost operations—align well with lakehouse requirements. However, challenges arise:
Incompatible storage APIs (POSIX‑style HDFS vs. object‑store key‑value semantics).
Metadata service lock‑in when using cloud‑provider‑specific catalogs.
Multi‑cloud or hybrid‑cloud deployments demand portable metadata and unified cataloging.
Why Apache Iceberg Fits Cloud‑Native Lakehouses
It abstracts storage, allowing direct access to object stores (AWS S3, Alibaba OSS, etc.) without Hadoop dependencies.
Its REST catalog API and support for external catalogs (Hive, AWS Glue) enable cloud‑agnostic metadata management.
Amoro: Cloud‑Native Lakehouse Management
Amoro provides a management service (AMS) and a pluggable optimizer to fill the missing pieces of a lakehouse:
Catalog Service : Implements Iceberg REST catalog, supports multiple catalogs, and can integrate with external metastore services.
Self‑Optimizing : AMS monitors write patterns, triggers compaction, file‑merge, and delete‑file cleanup automatically.
Elastic Optimizer : Deployable on YARN or Kubernetes, scales with workload peaks.
Resource Isolation : Optimizing groups share or isolate compute resources via table properties.
Scheduling Strategies : Balanced (fair sharing) or quota‑based weighting for different tables.
Unified UI : Web console shows tables, snapshots, transactions, and optimizer history.
Step‑by‑Step Deployment on AWS
1. Environment Preparation
Create an S3 bucket for the data lake.
Set up a VPC with at least two availability zones and public subnets.
Launch an EC2 instance (e.g., m5.large with 100 GB EBS) named amoro-ams for AMS.
Create an EKS cluster (e.g., amoro-k8s) with worker nodes (minimum two m5.large instances).
Generate an AWS access key pair and export it as environment variables:
export AWS_ACCESS_KEY_ID=<your-access-key></code><code>export AWS_SECRET_ACCESS_KEY=<your-secret-key>2. Install Required Software on the EC2 Host
#!/bin/bash</code><code># Docker</code><code>for pkg in docker.io docker-doc docker-compose podman-docker containerd runc; do sudo apt-get remove $pkg; done</code><code>apt-get update</code><code>apt-get install -y ca-certificates curl gnupg</code><code>install -m 0755 -d /etc/apt/keyrings</code><code>curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg</code><code>chmod a+r /etc/apt/keyrings/docker.gpg</code><code>echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null</code><code>apt-get update</code><code>apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin</code><code>docker run hello-world</code><code># Java</code><code>apt-get install -y openjdk-8-jdk</code><code># Maven</code><code>cd /usr/local/share</code><code>wget https://dlcdn.apache.org/maven/maven-3/3.8.8/binaries/apache-maven-3.8.8-bin.tar.gz</code><code>tar -zxvf apache-maven-3.8.8-bin.tar.gz</code><code>ln -s /usr/local/share/apache-maven-3.8.8/bin/mvn /usr/local/bin/mvn</code><code># Flink</code><code>wget https://dlcdn.apache.org/flink/flink-1.16.2/flink-1.16.2-bin-scala_2.12.tgz</code><code>tar zxvf flink-1.16.2-bin-scala_2.12.tgz</code><code># AWS CLI</code><code>apt-get install -y unzip</code><code>curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"</code><code>unzip awscliv2.zip</code><code>./aws/install</code><code># kubectl</code><code>curl -O https://s3.us-west-2.amazonaws.com/amazon-eks/1.27.1/2023-04-19/bin/linux/amd64/kubectl</code><code>chmod +x ./kubectl</code><code>mv ./kubectl /usr/local/bin/kubectl3. Build and Deploy Amoro AMS
git clone https://github.com/NetEase/amoro.git</code><code>cd amoro</code><code>git checkout v0.5.0</code><code>mvn clean install -DskipTests -am -e -pl dist -Paws -Poptimizer.flink1.166After the build, locate the .zip binary in ./dist/target, copy it to the EC2 host, unzip, and set AMORO_HOME:
unzip amoro-0.5.0-bin.zip</code><code>export AMORO_HOME=/root/amoro-0.5.0Edit conf/config.yml (refer to the official docs) and set ams.server-expose-host to the public IP of the EC2 instance. ${AMORO_HOME}/bin/ams.sh start Access the AMS UI at http://{amoro-ams-host}:1630 (default credentials: admin / admin).
4. Create Iceberg Catalog and Optimizer Group via AMS UI
In the AMS console:
Navigate to Optimizing → Optimizer Groups and add a group named default with type external.
Go to Catalogs , create a new catalog:
Catalog Type: Internal Catalog Table Format: Iceberg Warehouse: s3://your-bucket/defaut Authentication: SIMPLE (any placeholder value).
The catalog name (e.g., aws_default) will be used later in SQL statements.
5. Launch the Optimizer Job
Pull the pre‑built Docker image:
docker pull arctic163/optimizer-flink1.16:0.5.0-awsSubmit the Flink optimizer to the EKS cluster:
AMORO_THRIFT_ENDPOINT=thrift://{amoro-ams-host}:1261</code><code>OPTIMIZER_CLUSTER_ID=amoro-default-optimizer</code><code>${AMORO_HOME}/bin/flink run-application \
--target kubernetes-application \
-Dkubernetes.cluster-id=${OPTIMIZER_CLUSTER_ID} \
-Dkubernetes.jobmanager.service-account=flink \
-Dcontainerized.master.env.AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID} \
-Dcontainerized.master.env.AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY} \
-Dcontainerized.taskmanager.env.AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID} \
-Dcontainerized.taskmanager.env.AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY} \
-Dkubernetes.container.image=arctic163/optimizer-flink1.16:0.5.0-aws \
-Dkubernetes.container.image.pull-policy=Always \
-Dkubernetes.jobmanager.cpu=0.5 \
-Dkubernetes.taskmanager.cpu=0.5 \
-c com.netease.arctic.optimizer.flink.FlinkOptimizer \
local:///opt/flink/usrlib/OptimizeJob.jar \
-a ${AMORO_THRIFT_ENDPOINT} -g default -m 1024 -p 2The optimizer registers itself in AMS; you can view it under Optimizers in the console.
6. Create a Flink Session and Ingest Data
Start a Flink session on the EKS cluster (replace <your‑s3‑bucket>):
CLUSTER_ID=flink-iceberg-session</code><code>BUCKET=<your-s3-bucket></code><code>${AMORO_HOME}/bin/kubernetes-session.sh \
-Dkubernetes.cluster-id=${CLUSTER_ID} \
-Dkubernetes.jobmanager.service-account=flink \
-Dcontainerized.master.env.AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID} \
-Dcontainerized.master.env.AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY} \
-Dcontainerized.master.env.ENABLE_BUILT_IN_PLUGINS="flink-s3-fs-hadoop-1.16.2.jar" \
-Dcontainerized.taskmanager.env.AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID} \
-Dcontainerized.taskmanager.env.AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY} \
-Dcontainerized.taskmanager.env.ENABLE_BUILT_IN_PLUGINS="flink-s3-fs-hadoop-1.16.2.jar" \
-Dkubernetes.jobmanager.cpu=0.5 \
-Dkubernetes.taskmanager.cpu=0.5 \
-Dexecution.checkpointing.interval=15s \
-Ds3.access-key=${AWS_ACCESS_KEY_ID} \
-Ds3.secret-key=${AWS_SECRET_ACCESS_KEY} \
-Dstate.checkpoints.dir="s3://${BUCKET}/checkpoints/" \
-Dstate.backend=filesystem \
-Dkubernetes.container.image=arctic163/flink1.16-iceberg-aws:latestEnter the Flink JobManager container and launch the SQL client:
kubectl exec -it flink-iceberg-session-5f6cc679b7-smbnn -- /bin/bash</code><code>./bin/sql-client.shExecute the following statements (enter line‑by‑line):
SET 'execution.runtime-mode' = 'streaming';</code><code>SET 'table.dynamic-table-options.enabled' = 'true';</code><code>CREATE TABLE `source` (id INT, price DECIMAL(32,2), buyer STRING, order_time TIMESTAMP) WITH ('connector'='datagen','rows-per-second'='10','fields.id.min'='1','fields.id.max'='200');</code><code>CREATE CATALOG `amoro_iceberg` WITH ('type'='iceberg','catalog-impl'='org.apache.iceberg.rest.RESTCatalog','uri'='http://{amoro-ams-host}:1630/api/iceberg/rest','warehouse'='aws_default');</code><code>USE CATALOG `amoro_iceberg`;</code><code>CREATE DATABASE IF NOT EXISTS `sales`;</code><code>CREATE TABLE IF NOT EXISTS `sales`.`orders` (id INT, price DECIMAL(32,2), buyer STRING, order_time TIMESTAMP, PRIMARY KEY (id) NOT ENFORCED) WITH ('format-version'='2','write.upsert.enabled'='true','write.metadata.metrics.default'='full');</code><code>INSERT INTO `sales`.`orders` SELECT * FROM `default_catalog`.`default_database`.`source`;After submission, the Flink session and optimizer pods can be inspected with kubectl get pods.
7. Query the Ingested Data with Spark
In the AMS console, open the embedded Spark terminal, select the aws_default catalog, and run:
set `spark.sql.iceberg.handle-timestamp-without-timezone`=true;</code><code>select * from sales.orders order by id limit 10;The query returns the upserted rows; because upsert is enabled, the table size stays around 200 rows while values change over time.
8. Observe Optimizer Effects
Navigate to the table’s Optimizing tab in AMS to see a history of compaction jobs, input vs. output data sizes, and the frequency (typically every 3‑5 minutes). This demonstrates how small files and delete markers are merged, improving query performance.
Conclusion
The article introduced the cloud‑native lakehouse concept, highlighted its benefits and challenges, compared Apache Iceberg and Amoro, and demonstrated a full end‑to‑end deployment on AWS—including storage setup, AMS installation, catalog creation, optimizer launch, streaming data ingestion with Flink, and verification with Spark. Future work will cover additional cloud‑native use cases.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
