One-Click Real-Time Stream Ingestion: Alibaba Cloud Kafka’s Native Data Lake Integration
Alibaba Cloud Message Queue for Kafka introduces a native message‑to‑lake capability that integrates Apache Iceberg with OSS Table Bucket, eliminating Spark/Flink/Kafka Connect, providing exactly‑once semantics, automatic schema management, dual write modes, smart partitioning, and up to ten‑fold performance gains across diverse real‑time analytics scenarios.
Alibaba Cloud Message Queue for Kafka has launched a native message‑to‑lake capability that tightly integrates Apache Iceberg with Alibaba Cloud Object Storage (OSS) Table Bucket, enabling users to write Kafka topics to a data lake with a single click and achieve an end‑to‑end real‑time stream + lakehouse workflow.
In modern cloud‑native architectures, the traditional Kafka → data‑lake pipeline suffers from “ETL explosion”: each topic often requires a dedicated Spark or Flink job, manual schema synchronization, and heavy operational overhead for small‑file merging, snapshot cleanup, and partition optimization.
The new capability solves these problems with a zero‑ETL design. A built‑in worker component embedded in the Kafka broker automatically converts incoming records to Parquet, writes them to an OSS Table Bucket, and commits metadata, all without any external compute cluster. The process guarantees exactly‑once semantics, ensuring no duplicate or lost data during transfer.
Core Capability 1: Zero‑ETL Architecture with Exactly‑Once Guarantees
Data conversion and write are performed inside the broker layer, removing the need for separate Spark/Flink jobs. The worker writes Parquet files directly to OSS Table Bucket while preserving exactly‑once delivery semantics.
Core Capability 2: Automatic Schema Management
The solution deep‑integrates with Kafka Schema Registry, centralizing schema definitions that were previously scattered across producer code, Flink jobs, and Iceberg tables. When a producer sends data, the system validates it against the registered schema; schema changes trigger automatic version extraction and Iceberg table evolution without interrupting downstream workloads.
Core Capability 3: Dual Write Modes (Append & CDC)
Append mode : suited for log or event streams, data is appended to Iceberg tables with high‑throughput batch commits.
CDC mode : designed for change‑data‑capture scenarios, users specify a key field and the system performs upserts using Iceberg’s incremental file mechanism, eliminating the need for custom Flink upsert logic.
Both modes support flexible transformation strategies, including raw archiving and structured flatten/Debezium formats.
Core Capability 4: Smart Partitioning & Automated Operations
Users can define multi‑column partition strategies (e.g., date‑based) for Iceberg tables; query engines automatically prune partitions, dramatically reducing full‑table scans. OSS Table Bucket handles small‑file merging, expired snapshot cleanup, and orphan file removal in the background, freeing operational teams from manual lake‑maintenance tasks.
Underlying Storage
OSS Table Bucket provides native Apache Iceberg table semantics, fully compatible with Iceberg REST API and S3 Table API. Benchmarks show its TPS can exceed that of a self‑built Iceberg solution on a generic bucket by more than ten times, and it seamlessly integrates with Spark, Flink, Trino, and StarRocks for direct read/write access.
Typical Scenarios
1. E‑commerce flash‑sale analytics : One‑click lake ingestion eliminates all Flink jobs, upgrades dashboard refresh from hourly to minute‑level, and delivers >10× query performance.
2. Financial core‑system CDC : Directly streams MySQL binlog changes into Iceberg via CDC mode, achieving exactly‑once delivery, automatic schema evolution, and reducing data freshness latency from hours to minutes.
3. IoT sensor data archiving : Billions of vehicle sensor records are written as Parquet partitions (model + date) to OSS Table Bucket, with automated file merging and snapshot cleanup, enabling faster model training cycles.
4. AI multimodal training pipelines : Unified storage of objects, vectors, and tables allows raw sensor data, embeddings, and annotations to reside under a single permission and audit framework, accelerating data preparation by several folds.
5. Log audit & security tracing : Security logs are ingested directly into OSS Table Bucket, cutting storage costs, and leveraging Iceberg’s time‑travel to perform minute‑level cross‑year queries on petabyte‑scale logs.
Overall, the native ingestion capability embodies the broader industry shift toward decoupled, open data infrastructures, turning Kafka from a pure messaging pipeline into a dual‑mode entry point that supports both real‑time consumption and direct analytical access. Alibaba Cloud plans to further deepen the integration of Kafka, OSS, and streaming engines to provide a one‑stop data platform for AI‑driven enterprises.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
