Big Data 10 min read

One‑Stop Data Lake Ingestion Solution with Alibaba Cloud Data Lake Formation (DLF)

The article describes Alibaba Cloud's Data Lake Formation service, presenting a unified, real‑time, and low‑latency solution for ingesting heterogeneous data sources—including RDS, DTS, TableStore, and SLS—into an OSS‑backed data lake using templates, a Spark‑based ingestion engine, and modern file formats such as Delta Lake.

Big Data Technology Architecture

Nov 23, 2020

One‑Stop Data Lake Ingestion Solution with Alibaba Cloud Data Lake Formation (DLF)

Author: Peng Zhiwei (alias Kongjing), Alibaba Cloud technical expert.

Background

Data lakes serve as centralized repositories that store structured, semi‑structured, and unstructured data from various sources such as databases, binlog incremental streams, logs, and existing data warehouses. By consolidating these diverse datasets in cost‑effective object storage like OSS, data lakes break data silos and reduce storage and usage costs.

Because of the heterogeneity of data sources, a simple and efficient method is needed to migrate these datasets into a centralized data lake. The required capabilities include:

Unified ingestion method for heterogeneous sources.

Timely ingestion to meet minute‑level latency requirements for real‑time analytics.

Support for real‑time source changes (updates, deletes, schema evolution).

Alibaba Cloud introduced the Data Lake Formation (DLF) service to provide a complete one‑stop ingestion solution.

Overall Solution

The overall ingestion architecture consists of four components: ingestion templates, ingestion engine, file format, and data lake storage.

Ingestion Templates

Templates define common ingestion patterns and currently include five types: RDS full‑load, DTS incremental, TableStore, SLS, and file‑format conversion.

Users select the appropriate template for their source, fill in source parameters, create the template, and submit it to the ingestion engine.

Ingestion Engine

The engine leverages Alibaba Cloud EMR's self‑developed Spark Streaming SQL and Spark engines. Streaming SQL, built on Spark Structured Streaming, offers a rich SQL syntax that simplifies real‑time computation. Incremental templates are translated into Streaming SQL and run on a Spark cluster, with an extended MERGE INTO syntax to support update and delete operations. Full‑load templates are translated into standard Spark SQL.

File Formats

DLF supports Delta Lake, Parquet, JSON, and is adding Hudi. Formats like Delta Lake and Hudi provide native support for update, delete, and schema‑merge, addressing real‑time source change requirements.

Data Lake Storage

All ingested data is stored in OSS object storage, which offers massive capacity, high reliability, and cost efficiency.

One‑Stop Ingestion Benefits

Unified, simple ingestion via template configuration.

Minute‑level latency for real‑time data ingestion.

Support for source data changes through modern file formats.

Real‑Time Ingestion

To meet growing latency demands, DLF now supports real‑time ingestion for DTS, TableStore, and SLS.

DTS Incremental Real‑Time Ingestion

DTS provides reliable data replication for various databases. DLF supports both existing subscription channels and automatic channel creation, reducing configuration effort.

The solution enables minute‑level detection of updates and deletes by extending the MERGE INTO syntax to interact with Delta Lake.

MERGE INTO delta_tbl AS target
USING (
  select recordType, pk, ...
  from {{binlog_parser_subquery}}
) AS source
ON target.pk = source.pk
WHEN MATCHED AND source.recordType='UPDATE' THEN
  UPDATE SET *
WHEN MATCHED AND source.recordType='DELETE' THEN
  DELETE
WHEN NOT MATCHED THEN
  INSERT *

Compared with traditional data warehouses that require separate incremental and full tables, the lake‑based approach simplifies architecture and improves timeliness.

TableStore Real‑Time Ingestion

TableStore is Alibaba Cloud's NoSQL multi‑model database offering massive structured storage and fast query. It supports change‑data capture channels. DLF supports full, incremental, and hybrid (full + incremental) ingestion modes.

SLS Log Real‑Time Ingestion

SLS is Alibaba Cloud's log service. By configuring a simple SLS ingestion template (project, logstore, etc.), logs can be streamed into the data lake for immediate analysis.

Summary and Outlook

The one‑stop ingestion capability dramatically reduces the cost of bringing heterogeneous sources into a centralized OSS‑backed data lake, satisfies the timeliness requirements of sources like SLS and DTS, and supports real‑time source changes. Future work will expand supported source types, enrich template capabilities (including custom ETL), and continue performance optimizations for better latency and stability.

For more data lake discussions, join the Alibaba Data Lake technical DingTalk group.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Real-time Processing Alibaba Cloud Spark Streaming data ingestion Delta Lake

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.