How EMR‑StarRocks & Flink CDC Simplify Real‑Time Data Warehousing
This article explains how Alibaba Cloud EMR‑StarRocks integrates with Flink CDC, outlines common real‑time ingestion pain points, and introduces the CTAS/CDAS and Connector‑V2 features that streamline table creation, schema evolution, and resource‑efficient streaming for large‑scale analytics.
EMR‑StarRocks Overview
Alibaba Cloud EMR launched the StarRocks service, a next‑generation MPP data warehouse that offers MySQL compatibility, distributed architecture, horizontal partitioning with multi‑replica storage, elastic scaling up to 10 PB, vectorized engine, CBO, and elastic fault tolerance.
Flink‑CDC Concept
Change Data Capture (CDC) captures data changes for synchronization, distribution, and collection. Flink‑CDC streams MySQL binlog changes to downstream data lakes or warehouses. The article uses MySQL CDC as an example to discuss real‑time ingestion into StarRocks and related challenges.
User Pain Points in Real‑Time Ingestion
1. Large SQL development effort : Creating StarRocks tables for dozens of MySQL tables with many columns is labor‑intensive.
2. Complex field type mapping : Mapping MySQL types to Flink and StarRocks types is error‑prone, especially for many columns.
3. Schema change handling : Frequent schema changes require manual task stops, schema updates on MySQL, StarRocks, and Flink, and restart with savepoints.
4. Resource‑heavy CDC tasks : High memory and CPU consumption for many tables and high‑velocity incremental data.
CTAS & CDAS Solutions
EMR‑StarRocks and Flink introduced CTAS (Create Table As) and CDAS (Create Database As) to address the first three pain points. A single SQL statement can create StarRocks tables, launch Flink‑CDC jobs, and handle schema evolution automatically.
CTAS Syntax
CREATE TABLE IF NOT EXISTS runoob_tbl1 WITH (</code><code>'starrocks.create.table.properties'='engine=olap primary key(runoob_id) distributed by hash(runoob_id) buckets 8',</code><code>'database-name'='test_cdc',</code><code>'jdbc-url'='jdbc:mysql://172.16.xx.xx:9030',</code><code>'load-url'='172.16.xx.xx:8030',</code><code>'table-name'='runoob_tbl_sr',</code><code>'username'='test',</code><code>'password'='123456',</code><code>'sink.buffer-flush.interval-ms'='5000'</code><code>) AS TABLE mysql.test_cdc.runoob_tbl /*+ OPTIONS ( 'connector'='mysql-cdc', 'hostname'='rm-xxxx.mysql.rds.aliyuncs.com', 'port'='3306', 'username'='test', 'password'='123456', 'database-name'='test_cdc', 'table-name'='runoob_tbl' )*/;The starrocks.create.table.properties field carries MySQL‑to‑StarRocks specific configurations such as key type, partitioning, and bucket number.
CTAS Simple Mode
CREATE TABLE IF NOT EXISTS runoob_tbl1 WITH (</code><code>'starrocks.create.table.properties'='buckets 8',</code><code>'starrocks.create.table.mode'='simple',</code><code>'database-name'='test_cdc',</code><code>'jdbc-url'='jdbc:mysql://172.16.xx.xx:9030',</code><code>'load-url'='172.16.xx.xx:8030',</code><code>'table-name'='runoob_tbl_sr',</code><code>'username'='test',</code><code>'password'='123456',</code><code>'sink.buffer-flush.interval-ms'='5000'</code><code>) AS TABLE mysql.test_cdc.runoob_tbl /*+ OPTIONS ( ... )*/;Simple mode defaults to a primary‑key model using the MySQL primary key and hash partitioning, allowing users to create tables by specifying only the table name.
CTAS Principle
After executing CTAS, Flink automatically creates a StarRocks target table mirroring the MySQL source schema, establishes Sink/Source mappings, and launches a CDC job that syncs full and incremental data while monitoring and applying schema changes in real time.
CDAS Overview
CDAS extends CTAS to synchronize an entire MySQL database, generating a Flink job that creates multiple StarRocks tables and corresponding CDC tasks. Users can select specific tables with the INCLUDING TABLE clause.
CREATE DATABASE IF NOT EXISTS sr_db WITH (</code><code>'starrocks.create.table.properties'='buckets 8',</code><code>'starrocks.create.table.mode'='simple',</code><code>'database-name'='test_cdc',</code><code>'jdbc-url'='jdbc:mysql://172.16.xx.xx:9030',</code><code>'load-url'='172.16.xx.xx:8030',</code><code>'username'='test',</code><code>'password'='123456',</code><code>'sink.buffer-flush.interval-ms'='5000'</code><code>) AS TABLE mysql.test_cdc.runoob_tbl INCLUDING TABLE 'tbl1','tbl2','tbl3' /*+ OPTIONS ( ... )*/;Key Features of CTAS & CDAS
Multiple CDC tasks can share a single Flink job, reducing memory and CPU usage.
Source merging consolidates all table sources into one, lowering MySQL read pressure.
Supported schema change types: add column, delete column (by writing null), and rename column (add new column and write null for old).
Connector‑V2 Introduction
Connector‑V2 addresses the fourth pain point by using a two‑phase commit to reduce memory consumption. Phase 1 writes data in small batches (invisible), Phase 2 commits all batches atomically, leveraging StarRocks’ Begin, Prepare, and Commit interfaces.
Practical Use at 汇量
In 汇量’s advertising analysis, CDAS synchronizes dozens of small dimension tables in real time, while large fact tables are refreshed offline. Combining real‑time small tables with offline large tables enables low‑latency, cost‑effective analytics with exactly‑once guarantees via Flink checkpoints.
Overall, CTAS/CDAS and Connector‑V2 dramatically simplify development, reduce operational overhead, and improve resource efficiency for real‑time data pipelines on EMR‑StarRocks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
