Big Data 25 min read

Why Real-Time Data Lake Ingestion Is Dropping ETL in the AI Era: Architecture Simplification from Kafka to Iceberg

In the AI‑driven era, enterprises need a data foundation that supports both real‑time consumption and long‑term historical analysis, and the emerging "zero‑ETL" trend moves generic ingestion capabilities from external Flink/Spark jobs into a streamlined Kafka‑to‑Iceberg pipeline, reducing complexity while preserving low latency, consistency, schema evolution, CDC semantics and open‑ecosystem compatibility.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
Why Real-Time Data Lake Ingestion Is Dropping ETL in the AI Era: Architecture Simplification from Kafka to Iceberg

Introduction

In the AI‑driven era enterprises need a data foundation that supports both real‑time consumption and long‑term historical analysis. The combination of Kafka, Iceberg (an open table format) and object storage has become a common stack, but traditional pipelines that rely on external Flink or Spark streaming jobs introduce long chains, multiple system boundaries and heavy operational overhead.

Why Architecture Reduction?

Open format first : Iceberg’s open metadata standard, multi‑engine compatibility and strong schema‑evolution capabilities make it the core choice for lakehouse tables.

Zero‑ETL demand : Users increasingly want to eliminate data‑movement steps and let the messaging system persist data directly to an open table format, reducing latency and complexity.

Compute‑storage separation : Both Kafka storage and lake storage are moving toward decoupled compute, lowering cost and improving scalability.

Serverless trend : Serverless stream processing and messaging lower the operational barrier, making pay‑as‑you‑go models attractive for many workloads.

Zero‑ETL Concept

Zero‑ETL does not mean that no processing occurs; it means moving the generic ingestion capabilities—message decoding, schema mapping, offset management, transaction handling, small‑file control and failure recovery—from external jobs into the message‑to‑table path. This reduces three concrete burdens:

Extra data‑movement chain and system boundaries.

Repeated implementation of common ingestion logic across each ETL job.

Increasing platform‑maintenance cost as more external streaming clusters are added.

Kafka × Table Bucket Architecture

The end‑to‑end ingestion path consists of three key stages.

Record conversion : After a message lands in Kafka, a RecordProcessor component deserializes the key/value, applies transforms (e.g., Flatten, Debezium) and assembles a structured record that includes headers, topic, partition, offset and timestamp.

Schema‑aware evolution : Before writing, the system compares the record’s schema with the target table schema. Compatible changes such as adding optional fields, making a required field optional, or type promotion (e.g., int → long, float → double) trigger an in‑place schema refresh, avoiding manual interventions.

Iceberg write & transaction commit : The transformed record is written to object‑storage columnar files. Two write modes are supported:

Append : Directly appends using PartitionedWriter / UnpartitionedWriter.

Upsert : Generates both data files and delete files to support Insert/Update/Delete CDC semantics.

Files are rolled over when they reach a configurable size (e.g., 64 MB) to avoid small‑file explosion. A table‑level transaction atomically creates a new snapshot, providing a consistent view for downstream engines.

Key Capabilities

Low latency & strong consistency : Progress is stored in Kafka leader metadata, eliminating external KV stores and guaranteeing strong consistency.

Embedded & lightweight HA : Compute and storage are decoupled; follower replicas hold minimal metadata, enabling fast failover and elastic scaling.

Dual‑path sync : Incremental pre‑sync continuously reads, transforms and writes data, while a lightweight strong‑sync commits only the incremental changes, improving both latency and throughput.

Embedded vs. standalone deployment : Embedded mode runs within the broker process for minimal overhead; standalone mode isolates resources to avoid GC pressure on the broker.

Progress stored in leader metadata : Guarantees strong consistency without external coordination.

Multi‑catalog compatibility : Supports Iceberg REST Catalog and OSS Tables Catalog, allowing seamless integration with Spark, Trino, Flink, DuckDB and other engines.

Smart partitioning : Supports field partitioning, time‑based (year/month/day/hour), bucket, truncate and multi‑dimensional combinations.

partition_by: "region, day(timestamp)"
partition_by: "bucket(user_id, 10)"
partition_by: "truncate(email, 5)"

Full CDC / Upsert support : Native Debezium integration unwraps CDC events and maps Insert/Update/Delete to Iceberg Equality Delete semantics.

transforms: debezium_unwrap</code>
    <code>write_mode: upsert</code>
    <code>table_format: iceberg-v2

Comparison with Traditional ETL

Traditional path:

Kafka → Flink/Spark Streaming → Iceberg → Object Storage

. The zero‑ETL approach moves conversion and commit logic closer to the broker, eliminating an entire external processing layer, reducing system boundaries, and lowering total cost of ownership.

Typical Beneficial Scenarios

Real‑time log analysis : Logs collected by Filebeat/FluentBit into Kafka are partitioned by day and service name, then appended to Iceberg tables for Trino or Spark queries.

partition_by: "day(timestamp), service_name"</code>
    <code>write_mode: append</code>
    <code>target_file_size: 64MB

Database change capture : Debezium streams MySQL binlog or PostgreSQL WAL to Kafka; the pipeline unwraps CDC events and writes them in upsert mode, preserving primary‑key semantics and providing a near‑real‑time view of the business state.

transforms: debezium_unwrap</code>
    <code>write_mode: upsert

IoT multi‑source aggregation : High‑throughput device data is bucketed by device ID and day, enabling efficient pruning and long‑term storage.

partition_by: "bucket(device_id, 50), day(timestamp)"

AI multimodal training pipelines : Images, videos, point clouds and their metadata are all persisted in OSS Table Buckets, while vector indexes reside in OSS Vector Buckets. The unified storage simplifies data preparation and version back‑tracking for model training.

Conclusion

Zero‑ETL is not merely a buzzword; it is an architectural reduction that concentrates high‑frequency, generic ingestion logic into the platform itself, lowering operational complexity, reducing TCO, and delivering stable, low‑latency data pipelines that remain compatible with open‑format ecosystems.

Implementation note The Kafka × Table Bucket zero‑ETL capability described here is already available on Alibaba Cloud ApsaraMQ for Kafka × OSS Tables. Users can persist Kafka table topics as Iceberg tables in OSS Table Buckets and access them via Iceberg REST Catalog with Spark, Trino, Flink, DuckDB, etc. The solution includes multi‑layer small‑file governance, schema‑adaptive evolution, full CDC support and multi‑catalog compatibility.
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

StreamingKafkaData LakeIcebergZero ETLTable Bucket
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.