Big Data 14 min read

How EMR‑StarRocks & Flink CDC Simplify Real‑Time Data Warehousing

This article explains how Alibaba Cloud EMR‑StarRocks integrates with Flink CDC, outlines common real‑time ingestion pain points, and introduces the CTAS/CDAS and Connector‑V2 features that streamline table creation, schema evolution, and resource‑efficient streaming for large‑scale analytics.

Alibaba Cloud Big Data AI Platform

Nov 25, 2022

How EMR‑StarRocks & Flink CDC Simplify Real‑Time Data Warehousing

EMR‑StarRocks Overview

Alibaba Cloud EMR launched the StarRocks service, a next‑generation MPP data warehouse that offers MySQL compatibility, distributed architecture, horizontal partitioning with multi‑replica storage, elastic scaling up to 10 PB, vectorized engine, CBO, and elastic fault tolerance.

Flink‑CDC Concept

Change Data Capture (CDC) captures data changes for synchronization, distribution, and collection. Flink‑CDC streams MySQL binlog changes to downstream data lakes or warehouses. The article uses MySQL CDC as an example to discuss real‑time ingestion into StarRocks and related challenges.

User Pain Points in Real‑Time Ingestion

1. Large SQL development effort : Creating StarRocks tables for dozens of MySQL tables with many columns is labor‑intensive.

2. Complex field type mapping : Mapping MySQL types to Flink and StarRocks types is error‑prone, especially for many columns.

3. Schema change handling : Frequent schema changes require manual task stops, schema updates on MySQL, StarRocks, and Flink, and restart with savepoints.

4. Resource‑heavy CDC tasks : High memory and CPU consumption for many tables and high‑velocity incremental data.

CTAS & CDAS Solutions

EMR‑StarRocks and Flink introduced CTAS (Create Table As) and CDAS (Create Database As) to address the first three pain points. A single SQL statement can create StarRocks tables, launch Flink‑CDC jobs, and handle schema evolution automatically.

CTAS Syntax

CREATE TABLE IF NOT EXISTS runoob_tbl1 WITH (</code><code>'starrocks.create.table.properties'='engine=olap primary key(runoob_id) distributed by hash(runoob_id) buckets 8',</code><code>'database-name'='test_cdc',</code><code>'jdbc-url'='jdbc:mysql://172.16.xx.xx:9030',</code><code>'load-url'='172.16.xx.xx:8030',</code><code>'table-name'='runoob_tbl_sr',</code><code>'username'='test',</code><code>'password'='123456',</code><code>'sink.buffer-flush.interval-ms'='5000'</code><code>) AS TABLE mysql.test_cdc.runoob_tbl /*+ OPTIONS ( 'connector'='mysql-cdc', 'hostname'='rm-xxxx.mysql.rds.aliyuncs.com', 'port'='3306', 'username'='test', 'password'='123456', 'database-name'='test_cdc', 'table-name'='runoob_tbl' )*/;

The starrocks.create.table.properties field carries MySQL‑to‑StarRocks specific configurations such as key type, partitioning, and bucket number.

CTAS Simple Mode

CREATE TABLE IF NOT EXISTS runoob_tbl1 WITH (</code><code>'starrocks.create.table.properties'='buckets 8',</code><code>'starrocks.create.table.mode'='simple',</code><code>'database-name'='test_cdc',</code><code>'jdbc-url'='jdbc:mysql://172.16.xx.xx:9030',</code><code>'load-url'='172.16.xx.xx:8030',</code><code>'table-name'='runoob_tbl_sr',</code><code>'username'='test',</code><code>'password'='123456',</code><code>'sink.buffer-flush.interval-ms'='5000'</code><code>) AS TABLE mysql.test_cdc.runoob_tbl /*+ OPTIONS ( ... )*/;

Simple mode defaults to a primary‑key model using the MySQL primary key and hash partitioning, allowing users to create tables by specifying only the table name.

CTAS Principle

After executing CTAS, Flink automatically creates a StarRocks target table mirroring the MySQL source schema, establishes Sink/Source mappings, and launches a CDC job that syncs full and incremental data while monitoring and applying schema changes in real time.

CDAS Overview

CDAS extends CTAS to synchronize an entire MySQL database, generating a Flink job that creates multiple StarRocks tables and corresponding CDC tasks. Users can select specific tables with the INCLUDING TABLE clause.

CREATE DATABASE IF NOT EXISTS sr_db WITH (</code><code>'starrocks.create.table.properties'='buckets 8',</code><code>'starrocks.create.table.mode'='simple',</code><code>'database-name'='test_cdc',</code><code>'jdbc-url'='jdbc:mysql://172.16.xx.xx:9030',</code><code>'load-url'='172.16.xx.xx:8030',</code><code>'username'='test',</code><code>'password'='123456',</code><code>'sink.buffer-flush.interval-ms'='5000'</code><code>) AS TABLE mysql.test_cdc.runoob_tbl INCLUDING TABLE 'tbl1','tbl2','tbl3' /*+ OPTIONS ( ... )*/;

Key Features of CTAS & CDAS

Multiple CDC tasks can share a single Flink job, reducing memory and CPU usage.

Source merging consolidates all table sources into one, lowering MySQL read pressure.

Supported schema change types: add column, delete column (by writing null), and rename column (add new column and write null for old).

Connector‑V2 Introduction

Connector‑V2 addresses the fourth pain point by using a two‑phase commit to reduce memory consumption. Phase 1 writes data in small batches (invisible), Phase 2 commits all batches atomically, leveraging StarRocks’ Begin, Prepare, and Commit interfaces.

Practical Use at 汇量

In 汇量’s advertising analysis, CDAS synchronizes dozens of small dimension tables in real time, while large fact tables are refreshed offline. Combining real‑time small tables with offline large tables enables low‑latency, cost‑effective analytics with exactly‑once guarantees via Flink checkpoints.

Overall, CTAS/CDAS and Connector‑V2 dramatically simplify development, reduce operational overhead, and improve resource efficiency for real‑time data pipelines on EMR‑StarRocks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data StarRocks Flink CDC CDAS Connector V2 CTAS Real-time Data Ingestion

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.