Big Data 19 min read

Amoro Lakehouse Management System: Deployment Practices and AWS Integration for Apache Iceberg

This article introduces Amoro, a lakehouse management platform built on Apache Iceberg, explains why Webex adopted it to overcome Hive limitations, details its AWS GlueCatalog and S3 integration with DynamoDB lock management, and provides step‑by‑step Helm‑based deployment instructions on Kubernetes.

DataFunTalk

Nov 24, 2023

Amoro Lakehouse Management System: Deployment Practices and AWS Integration for Apache Iceberg

Amoro is a lakehouse management system built on open‑source table formats such as Apache Iceberg, offering plug‑in data self‑optimization mechanisms and management services for an out‑of‑the‑box lakehouse experience.

Author : Bai Xu, Software Engineer at Cisco WebEx Data Platform, responsible for lakehouse‑integrated development and optimization.

Why choose Amoro

Webex originally used Hive for storage, but Hive’s format caused inefficient data correction, back‑tracking, and high maintenance overhead. Migrating to Apache Iceberg reduced operational costs and improved core business efficiency. However, Iceberg V2 introduced row‑level updates that caused performance issues due to Merge‑on‑Read (MOR) when many delete files accumulated, making query latency unacceptable.

Initial attempts to merge small files using Spark compaction procedures resulted in high resource consumption (over 40 cores and 300 GB memory per job), long execution times, low fault tolerance, and difficult maintenance when a single table failure halted the entire job.

Amoro addresses these pain points by registering an external Flink optimizer, pulling optimization tasks from the Amoro Management Service (AMS), and enabling snapshot expiration and data expiration to reduce storage pressure.

Benefits of Amoro

Higher resource utilization: Flink optimizer reduces resource usage by about 70% compared to Spark.

Improved fault tolerance: Failed optimization tasks are automatically retried on the next scan.

Timeliness: Continuous compaction keeps Iceberg query performance within a controllable range.

Self‑management: Optimization can be toggled per table via table properties.

Visualization: WebUI displays optimization status and table metadata.

Usage in Webex

Amoro has been deployed across multiple data centers and clusters (up to seven data centers), in both Hadoop and AWS environments, managing over 1,000 Iceberg tables.

Amoro on AWS

Key challenges include integrating Iceberg with AWS services (Catalog and FileSystem) and adapting AMS. The migration switched from HiveCatalog to GlueCatalog and from HDFS to S3, leveraging S3’s fine‑grained IAM permissions and eliminating hardware maintenance costs.

GlueCatalog reduces the need for a separate Hive Metastore service and MySQL metadata storage, avoiding issues such as MySQL connection limits.

LockManager

Because S3 does not provide atomic write locks, Iceberg uses DynamoDBLockManager to ensure metadata consistency. The lock acquisition workflow involves attempting to acquire a lock, retrying on contention, writing a new metadata‑v2.json, and releasing the lock after a successful commit.

The DynamoDB lock table stores entries such as:

Primary Key

Attributes

pda.orders

Lock Entity ID: pda.orders

Lease Duration (ms): 15000

Version: d3b9b4ec-6c02-4e7e-9570-927ba1bafa67

Lock Owner ID: s3://wap-bucket/orders/metadata/d3b9b4ec-6c02-4e7e-9570-927ba1bafa67-metadata.json

pda.customers

Lock Entity ID: pda.customers

Lease Duration (ms): 15000

Version: 0f50e24d-e7da-4c8b-aa4b-1b95a50c7f38

Lock Owner ID: s3://wap-bucket/customers/metadata/0f50e24d-e7da-4c8b-aa4b-1b95a50c7f38-metadata.json

pda.products

Lock Entity ID: pda.products

Lease Duration (ms): 15000

Version: 2dab53a2-7c63-4b95-8fe1-567f73e58d6c

Lock Owner ID: s3://wap-bucket/products/metadata/2dab53a2-7c63-4b95-8fe1-567f73e58d6c-metadata.json

Using DynamoDB for lock management avoids stale locks that can block Spark jobs in Hive Metastore.

Permission Control

AWS IAM accounts can grant fine‑grained permissions to S3, Glue, and DynamoDB. Teams receive dedicated IAM accounts, and Kubernetes namespaces are used to isolate IAM credentials, enabling table‑level access control.

S3 Intelligent‑Tiering

Setting the Iceberg storage‑class to S3 Intelligent‑Tiering automatically moves objects between frequent, infrequent, and archive access tiers, reducing storage costs by up to 68%.

AMS AWS Adaptations

AMS was adapted to run on AWS by using a custom catalog that creates a GlueCatalog and refactoring Arctic’s FileIO to support object storage. Future versions will expose GlueCatalog as a distinct catalog type with IAM configuration.

Credential Management

Credentials are supplied via the DefaultAWSCredentialsProviderChain environment variables in Kubernetes pods. Example snippet:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/name: ams
  name: ams
spec:
  replicas: {{ .Values.replicas }}
  template:
    spec:
      containers:
        - env:
            - name: AWS_ACCESS_KEY_ID
              value: AKIXXXXXXXXXXXXXXXX
            - name: AWS_SECRET_ACCESS_KEY
              value: fjHyrM1wTJ8CLP13+GU1bCGG1RGlL1xT1lXyBb11
          image: {{ include "udp.amoro.image.fullname" .}}
          ...

IAM Roles for Service Accounts (IRSA) can replace static keys, providing secure, token‑based authentication.

Deployment Practice

The article demonstrates deploying Amoro 0.6.0 using Helm charts on Kubernetes. The process includes building Docker images:

mvn clean install -DskipTests -am -e -pl dist

docker build docker/ams/ --platform amd64 -t xxx/amoro && docker push xxx/amoro

Helm templates define helpers, pod mounts, volumes, Deployments, Services, ServiceAccounts, Secrets, Ingress, and PodMonitors. Example helper definition:

{{- define "udp.amoro.image.fullname" -}}
{{ .Values.image.repository }}/{{ .Values.image.component }}:{{ .Values.image.tag | default .Chart.AppVersion }}
{{- end -}}

Deploy with: helm upgrade --install amoro ./ --namespace amoro Additional configuration includes registering the GlueCatalog, setting warehouse, lock‑impl, lock.table, and client.credentials-provider for IRSA.

Future Plans

Incremental SORT/ZORDER for data skipping and clustering.

Enhanced monitoring and alerting for table health and optimization latency.

Develop a Kubernetes‑native optimizer to replace the external Flink optimizer, improving resource elasticity.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

kubernetes AWS Data Lake Apache Iceberg Lakehouse helm Amoro

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.