Big Data 6 min read

Build Real-Time Data Lake Analytics with Flink, Paimon, and EMR Serverless Spark

This guide demonstrates how to use Alibaba Cloud's EMR Serverless Spark and Flink Serverless services together with Apache Paimon to ingest streaming data, perform interactive queries, and schedule offline compaction jobs, creating a unified real‑time and batch data lake solution.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Build Real-Time Data Lake Analytics with Flink, Paimon, and EMR Serverless Spark

Serverless Spark Job Scheduling

Serverless Spark also supports job scheduling. After publishing a developed task, you can create a workflow, orchestrate tasks, and configure a scheduling policy to run periodic jobs, such as Compact for Paimon tables.

In the task development page, write and publish the Paimon Compact SQL.

CALL paimon.sys.compact (</code><code>  table => 'test_paimon_db.test_append_tbl',</code><code>  partitions => 'dt="2024-06-04",hh="12"',</code><code>  order_strategy => 'zorder',</code><code>  order_by => 'category'</code><code>);

In the workflow page, create a workflow, add a node, and bind it to the published task.

Each workflow node can specify its own engine version and Spark configuration, such as:

spark.sql.extensions                org.apache.paimon.spark.extensions.PaimonSparkSessionExtensions</code><code>spark.sql.catalog.paimon            org.apache.paimon.spark.SparkCatalog</code><code>spark.sql.catalog.paimon.metastore  dlf</code><code>spark.sql.catalog.paimon.warehouse  oss://test/warehouse

Manually run the workflow.

After successful execution, verify the Compact effect by querying the Paimon $files system table.

SELECT file_path, record_count, file_size_in_bytes FROM `paimon`.`test_paimon_db`.`test_append_tbl$files` WHERE partition='[2024-06-04, 12]';

Result before Compact:

Result after Compact:

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataFlinkPaimonData Lakeemr serverless
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.