One‑Stop Data Lake Ingestion Solution with Alibaba Cloud Data Lake Formation (DLF)
The article describes Alibaba Cloud's Data Lake Formation service, presenting a unified, real‑time, and low‑latency solution for ingesting heterogeneous data sources—including RDS, DTS, TableStore, and SLS—into an OSS‑backed data lake using templates, a Spark‑based ingestion engine, and modern file formats such as Delta Lake.
Author: Peng Zhiwei (alias Kongjing), Alibaba Cloud technical expert.
Background
Data lakes serve as centralized repositories that store structured, semi‑structured, and unstructured data from various sources such as databases, binlog incremental streams, logs, and existing data warehouses. By consolidating these diverse datasets in cost‑effective object storage like OSS, data lakes break data silos and reduce storage and usage costs.
Because of the heterogeneity of data sources, a simple and efficient method is needed to migrate these datasets into a centralized data lake. The required capabilities include:
Unified ingestion method for heterogeneous sources.
Timely ingestion to meet minute‑level latency requirements for real‑time analytics.
Support for real‑time source changes (updates, deletes, schema evolution).
Alibaba Cloud introduced the Data Lake Formation (DLF) service to provide a complete one‑stop ingestion solution.
Overall Solution
The overall ingestion architecture consists of four components: ingestion templates, ingestion engine, file format, and data lake storage.
Ingestion Templates
Templates define common ingestion patterns and currently include five types: RDS full‑load, DTS incremental, TableStore, SLS, and file‑format conversion.
Users select the appropriate template for their source, fill in source parameters, create the template, and submit it to the ingestion engine.
Ingestion Engine
The engine leverages Alibaba Cloud EMR's self‑developed Spark Streaming SQL and Spark engines. Streaming SQL, built on Spark Structured Streaming, offers a rich SQL syntax that simplifies real‑time computation. Incremental templates are translated into Streaming SQL and run on a Spark cluster, with an extended MERGE INTO syntax to support update and delete operations. Full‑load templates are translated into standard Spark SQL.
File Formats
DLF supports Delta Lake, Parquet, JSON, and is adding Hudi. Formats like Delta Lake and Hudi provide native support for update, delete, and schema‑merge, addressing real‑time source change requirements.
Data Lake Storage
All ingested data is stored in OSS object storage, which offers massive capacity, high reliability, and cost efficiency.
One‑Stop Ingestion Benefits
Unified, simple ingestion via template configuration.
Minute‑level latency for real‑time data ingestion.
Support for source data changes through modern file formats.
Real‑Time Ingestion
To meet growing latency demands, DLF now supports real‑time ingestion for DTS, TableStore, and SLS.
DTS Incremental Real‑Time Ingestion
DTS provides reliable data replication for various databases. DLF supports both existing subscription channels and automatic channel creation, reducing configuration effort.
The solution enables minute‑level detection of updates and deletes by extending the MERGE INTO syntax to interact with Delta Lake.
MERGE INTO delta_tbl AS target
USING (
select recordType, pk, ...
from {{binlog_parser_subquery}}
) AS source
ON target.pk = source.pk
WHEN MATCHED AND source.recordType='UPDATE' THEN
UPDATE SET *
WHEN MATCHED AND source.recordType='DELETE' THEN
DELETE
WHEN NOT MATCHED THEN
INSERT *Compared with traditional data warehouses that require separate incremental and full tables, the lake‑based approach simplifies architecture and improves timeliness.
TableStore Real‑Time Ingestion
TableStore is Alibaba Cloud's NoSQL multi‑model database offering massive structured storage and fast query. It supports change‑data capture channels. DLF supports full, incremental, and hybrid (full + incremental) ingestion modes.
SLS Log Real‑Time Ingestion
SLS is Alibaba Cloud's log service. By configuring a simple SLS ingestion template (project, logstore, etc.), logs can be streamed into the data lake for immediate analysis.
Summary and Outlook
The one‑stop ingestion capability dramatically reduces the cost of bringing heterogeneous sources into a centralized OSS‑backed data lake, satisfies the timeliness requirements of sources like SLS and DTS, and supports real‑time source changes. Future work will expand supported source types, enrich template capabilities (including custom ETL), and continue performance optimizations for better latency and stability.
For more data lake discussions, join the Alibaba Data Lake technical DingTalk group.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.