Big Data 12 min read

How Alibaba Cloud DataWorks Leverages Flink CDC for Scalable Data Lake Integration

Alibaba Cloud DataWorks’ Data Integration platform, built on Flink CDC, offers a comprehensive, serverless solution for real‑time and batch data lake ingestion, detailing its architecture, elastic scaling, productized use cases, and future roadmap, including AI‑driven diagnostics and expanded source support.

Alibaba Cloud Big Data AI Platform

Jan 23, 2025

How Alibaba Cloud DataWorks Leverages Flink CDC for Scalable Data Lake Integration

Abstract

Summary: This article compiles a talk by Chen Jitong from Alibaba Cloud DataWorks Data Integration team at Flink Forward Asia 2024, focusing on the application of Flink CDC in DataWorks data‑lake integration. It is organized into four parts: introduction to DataWorks Data Integration, architecture and principles of the lake‑ingestion solution, productized case studies, and future plans.

Alibaba Cloud DataWorks Data Integration Introduction

DataWorks Data Integration has a long history on Alibaba Cloud. Launched in 2014, it has evolved through multiple DataX versions and, since 2020, provides real‑time synchronization services. In 2023 the platform adopted a new engine based on Flink CDC and DataX, and in 2024 it offers lake ingestion, elastic scaling, and Serverless synchronization services.

The platform positions itself as the core hub for data cloud migration, aiming to deliver a reliable, secure, low‑cost, and elastically scalable data synchronization platform across heterogeneous sources.

DataWorks supports over 50 offline source types and more than ten real‑time source types, handling complex network environments (IDC, VPC, etc.) and providing end‑to‑end solutions such as batch migration, incremental sync, and one‑click real‑time sharding. It also offers traffic control, dirty‑data handling, resource monitoring, and multi‑channel alert notifications.

Daily synchronization volume reaches ~10 PB and ~10 trillion records, serving over 130 BU’s across 21 public cloud regions and 180+ dedicated cloud customers.

DataWorks Lake‑Ingestion Solution Architecture and Principles

The platform architecture consists of four layers: Access, Control, Engine, and Resource.

Access Layer: Users configure tasks via Open API, Web UI, or JSON Spec, including task creation, start/stop, and source definition.

Control Layer: Handles job checks, configuration, task lifecycle, metric monitoring, and alerting.

Engine Layer: Includes a Catalog Server for source metadata and a Flink CDC‑based stream‑batch engine (re‑engineered from DataX).

Resource Layer: Powered by Alibaba Serverless Infrastructure.

Key Architectural Features

1. Provides a one‑stop solution for structural migration and full‑incremental sync, supporting both DML and DDL events to ensure data integrity. Rich T‑node capabilities enable string replacement, data masking, JSON parsing, filtering, and logical deletion.

2. Supports real‑time multi‑table writes, PK‑shuffle distribution to avoid hotspots, and elastic scaling to optimize performance and cost.

Engine Architecture Based on Flink CDC

Supports multiple source types (MySQL, PostgreSQL, Kafka, Loghub) via Flink CDC Source for full and incremental ingestion.

Event parsing converts source events into Insert, Update, Delete, and Alter operations.

Data is hash‑distributed by primary key to avoid hotspot tables.

TableMapping maps events to target tables, offering T‑node functions such as string replacement, masking, JSON parsing, filtering, and logical deletion.

Supports lake formats (Paimon, Hudi, Iceberg) as sinks, replaying DML and DDL, and optionally syncing metadata to DLF while storing data in OSS or OSS‑HDFS.

Full‑Incremental Lake Ingestion Process

The workflow includes three steps:

Structure Migration: Extract source schema, map to target schema, generate and execute DDL.

Full Sync: Migrate historical data from source to target.

Incremental Sync: Align checkpoints, then capture and sync real‑time changes and schema evolution.

Challenges include high resource consumption during full sync, especially during peak business periods.

Elastic Scaling Architecture with AutoCopilot

AutoCopilot enables dynamic resource scaling based on user‑defined parameters. The DataWorks control system interacts with the resource scheduler, passes parameters to Flink VPP, which coordinates with Flink VVR to adjust task resources. Upon completion, messages are sent back to update resource allocation, supporting both scheduled and intelligent tuning.

Productized Case Study

A customer built a lake‑ingestion pipeline that continuously syncs MySQL data to a Paimon table, enabling downstream processing and analytics. Leveraging the engine’s performance and elastic scaling, the customer reduced costs by approximately 50%.

Future Planning

Expand Cloud User Scenarios: Add support for more source types such as Oracle and Hive.

AI‑Driven Task Diagnosis: Use large language models to provide self‑service operation capabilities.

Data Quality Verification: Offer periodic or real‑time source‑target data comparison to ensure consistency and completeness.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Serverless Big Data Data Integration Elastic Scaling Data Lake Flink CDC

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.