Big Data 27 min read

How AI-Driven Real-Time Data Lakes Are Ditching ETL: A Kafka‑to‑Iceberg Architecture Simplification

In the AI era, enterprises need a data foundation that supports both low‑latency streaming and long‑term analytics, and the combination of Kafka, Iceberg and object storage is emerging as a preferred solution; by moving ingestion capabilities closer to the message layer and eliminating external ETL jobs, a "zero‑ETL" approach reduces architectural complexity, improves consistency, and streamlines schema evolution and small‑file management.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How AI-Driven Real-Time Data Lakes Are Ditching ETL: A Kafka‑to‑Iceberg Architecture Simplification

Why Architecture Reduction Is Needed in the AI Era

AI‑driven applications require data that can be consumed in real time for model training, feature engineering, and online inference while also being retained for historical analysis. Traditional pipelines that insert an external Flink or Spark streaming job between Kafka and the data lake add latency, increase system boundaries, and complicate operations.

Key Trends in Real‑Time Ingestion

Open format first : Iceberg’s open metadata and schema‑evolution capabilities make it the dominant table format for lakehouse architectures.

Zero‑ETL demand : Users increasingly want the message system to persist data directly as open‑format tables, cutting down on data‑movement steps.

Deepening storage‑compute separation : Both the messaging layer and the storage layer are moving toward a decoupled design to lower costs.

Serverless adoption : Managed, pay‑per‑use services reduce operational overhead for many workloads.

Three Kafka Ingestion Camps

Native integration (e.g., Confluent Tableflow, Redpanda Iceberg Topics) – simple, low‑latency, but risk vendor lock‑in.

Connector / ETL (e.g., Flink, Spark, Debezium) – flexible and ecosystem‑rich, but adds extra clusters and higher latency.

Ecosystem platforms (e.g., Databricks, Snowflake) – deep engine integration, but often tied to a single format or engine.

What Is Actually Reduced

An extra data‑movement chain.

Repeated implementation of generic ingestion logic (schema mapping, offset handling, transaction commit, small‑file control).

Unrelated operational complexity.

Zero‑ETL Kafka × Table Bucket Architecture

The core idea is to move the generic ingestion capabilities as close to Kafka as possible, creating a three‑layer pipeline:

Protocol layer : Directly compatible with the Kafka Producer/Consumer protocol.

Conversion layer : Handles format conversion, schema awareness, CDC unpacking, and partition routing.

Table storage layer : Provides table writes, metadata management, and background optimizations on object storage.

The end‑to‑end flow can be expressed as:

Kafka → RecordProcessor → IcebergWriter → ObjectStorage

Three Processing Stages

Record conversion : Deserializes key/value, applies transforms (e.g., Flatten, Debezium), and assembles a structured record with metadata.

Schema‑aware evolution : Detects compatible changes such as ADD_FIELD, MAKE_OPTIONAL, or type promotion ( int → long, float → double) and applies them automatically.

Iceberg write & transaction : Writes data in Append or Upsert mode, creates DeleteFiles for CDC deletes, and commits a snapshot once the target file size (e.g., 64 MB) is reached.

Key Capabilities

Low latency & strong consistency : Progress is stored in Kafka leader metadata, eliminating external KV stores and enabling fast failover.

Multi‑layer small‑file governance : In‑memory buffer merge (L1), 32 MB micro‑batches (L2), 64 MB target file size (L3), and asynchronous background compaction (L4).

Embedded HA : Light‑weight follower replicas act as hot‑standby, allowing rapid takeover during scaling or failure.

Schema‑adaptive evolution : Automatic handling of ADD_FIELD, MAKE_OPTIONAL, PROMOTE_TYPE, and nested recursive changes.

Full CDC / Upsert support : Debezium unwrap transforms map Insert/Update/Delete events to Iceberg’s Equality‑Delete mechanism.

Smart partitioning : Supports seven partition types (field, year/month/day/hour, bucket, truncate) and multi‑dimensional combos. Example configurations:

# Partition by region and day
partition_by: "region, day(timestamp)"
# High‑cardinality bucket
partition_by: "bucket(user_id, 10)"
# Email prefix truncate
partition_by: "truncate(email, 5)"

Multi‑catalog compatibility : Works with Iceberg REST Catalog and OSS Tables catalog, allowing Spark, Trino, Flink, DuckDB and other engines to query the same tables.

Comparison with Traditional Approaches

Architecture complexity : Traditional Kafka → Flink/Spark → Iceberg is a three‑stage, independently scheduled pipeline (high); Connector‑only solutions are medium; Kafka × Table Bucket is a single, in‑process chain (low).

Development cost : Writing streaming SQL or code for Flink/Spark is high; configuring a connector sink is medium; declarative configuration for zero‑ETL is low.

Schema evolution : Requires manual effort or external frameworks in traditional setups; partially supported in connector solutions; fully automated in zero‑ETL.

Small‑file handling : Handled by the application in traditional pipelines; partially supported in connectors; built‑in multi‑layer governance in zero‑ETL.

Exactly‑once guarantees : Relies on business logic in traditional pipelines; complex configuration in connectors; achieved via progress‑in‑leader metadata and atomic snapshots in zero‑ETL.

Operational cost : High for external clusters; medium for connector clusters; low for a single integrated service.

Scenarios That Benefit Most

Real‑time log analysis : Continuous logs (application, access, security) written daily or by service name can be queried directly with Trino or Spark.

Database change capture : Debezium‑produced CDC events are upserted into Iceberg tables, providing an up‑to‑date view of the source database.

IoT multi‑source aggregation : High‑throughput device streams with frequent schema changes are bucket‑partitioned and stored for both real‑time and long‑term analytics.

AI multimodal training pipelines : Images, video, point‑clouds and their metadata are persisted in OSS object buckets, vector buckets, and Table Buckets, enabling unified versioning and retrieval.

Conclusion

The "zero‑ETL" concept is not about eliminating transformation but about moving repetitive, generic ingestion logic—schema handling, CDC mapping, small‑file control—into the platform itself. By shortening the data path, reducing system boundaries, and providing built‑in governance, Kafka × Table Bucket delivers lower latency, stronger consistency, and a lower total cost of ownership while keeping the data lake open and interoperable.

Implementation Note The described capabilities are already available in Alibaba Cloud ApsaraMQ for Kafka × OSS Tables, with an open beta for users to try without adding extra ETL jobs. The solution supports Iceberg REST Catalog, multi‑catalog integration, and the full set of schema‑evolution, CDC, and small‑file governance features discussed above.
Kafka to Iceberg architecture
Kafka to Iceberg architecture
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

StreamingKafkaData LakeSchema EvolutionIcebergCDCZero ETL
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.