Big Data 8 min read

How to Deploy a PySpark Streaming Job on EMR Serverless Spark

This guide walks you through creating a Kafka‑enabled EMR Serverless Spark cluster, configuring network connections and security groups, uploading JARs and Python resources, and finally launching and monitoring a PySpark streaming application.

Alibaba Cloud Big Data AI Platform

Jul 19, 2024

How to Deploy a PySpark Streaming Job on EMR Serverless Spark

Prerequisites

Workspace has been created (see link).

Operation Process

Step 1: Create a real‑time data stream cluster and generate messages

On the EMR on ECS page, create a cluster that includes Kafka service.

Switch to the directory cd /var/log/emr/taihao_exporter.

Create a Kafka topic named taihaometrics with 10 partitions and replication factor 2:

kafka-topics.sh --partitions 10 --replication-factor 2 --bootstrap-server core-1-1:9092 --topic taihaometrics --create

Send messages to the topic:

tail -f metrics.log | kafka-console-producer.sh --broker-list core-1-1:9092 --topic taihaometrics

Step 2: Add a network connection

Enter the Network Connection page via EMR Serverless → Spark → target workspace → Network Connection.

Click “Add Network Connection” and configure:

Connection name (e.g., connection_to_emr_kafka).

VPC: select the same VPC as the EMR cluster.

Switch: select the same switch as the EMR cluster.

When the status shows “Success”, the connection is added.

Step 3: Add security group rules for the EMR cluster

Obtain the subnet of the cluster’s node switch from the Node Management page.

In the cluster’s Security Group page, add a manual rule:

Port range: 9092.

Authorized CIDR: the subnet obtained in the previous step (do not use 0.0.0.0/0).

Step 4: Upload JAR packages to OSS

Upload all JAR files from kafka.zip to OSS (see simple upload guide).

Step 5: Upload resource files

In EMR Serverless Spark → Resource Upload, click “Upload File”.

Select pyspark_ss_demo.py for upload.

Step 6: Create and start the streaming job

In EMR Serverless Spark, go to “Job Development”.

Click “New”, choose Application → PySpark, name the job, and confirm.

Configure the following parameters and save:

Main Python resource: the uploaded pyspark_ss_demo.py.

Engine version: Spark version (see engine version guide).

Run parameters: internal IP of the EMR core node.

Spark configuration (example):

spark.jars oss://<yourBucket>/kafka-lib/commons-pool2-2.11.1.jar,oss://<yourBucket>/kafka-lib/kafka-clients-2.8.1.jar,oss://<yourBucket>/kafka-lib/spark-sql-kafka-0-10_2.12-3.3.1.jar,oss://<yourBucket>/kafka-lib/spark-token-provider-kafka-0-10_2.12-3.3.1.jar
spark.emr.serverless.network.service.name connection_to_emr_kafka

Publish the job and confirm.

Start the streaming job, then go to “Operations”, click “Start”.

Step 7: View logs

Open the “Log Exploration” tab.

In the Driver logs, click stdOut.log to see execution details.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Streaming PySpark EMR Serverless

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.