How to Deploy a PySpark Streaming Job on EMR Serverless Spark
This guide walks you through creating a Kafka‑enabled EMR Serverless Spark cluster, configuring network connections and security groups, uploading JARs and Python resources, and finally launching and monitoring a PySpark streaming application.
Prerequisites
Workspace has been created (see link).
Operation Process
Step 1: Create a real‑time data stream cluster and generate messages
On the EMR on ECS page, create a cluster that includes Kafka service.
Log in to the EMR cluster’s Master node.
Switch to the directory cd /var/log/emr/taihao_exporter.
Create a Kafka topic named taihaometrics with 10 partitions and replication factor 2:
kafka-topics.sh --partitions 10 --replication-factor 2 --bootstrap-server core-1-1:9092 --topic taihaometrics --createSend messages to the topic:
tail -f metrics.log | kafka-console-producer.sh --broker-list core-1-1:9092 --topic taihaometricsStep 2: Add a network connection
Enter the Network Connection page via EMR Serverless → Spark → target workspace → Network Connection.
Click “Add Network Connection” and configure:
Connection name (e.g., connection_to_emr_kafka).
VPC: select the same VPC as the EMR cluster.
Switch: select the same switch as the EMR cluster.
When the status shows “Success”, the connection is added.
Step 3: Add security group rules for the EMR cluster
Obtain the subnet of the cluster’s node switch from the Node Management page.
In the cluster’s Security Group page, add a manual rule:
Port range: 9092.
Authorized CIDR: the subnet obtained in the previous step (do not use 0.0.0.0/0).
Step 4: Upload JAR packages to OSS
Upload all JAR files from kafka.zip to OSS (see simple upload guide).
Step 5: Upload resource files
In EMR Serverless Spark → Resource Upload, click “Upload File”.
Select pyspark_ss_demo.py for upload.
Step 6: Create and start the streaming job
In EMR Serverless Spark, go to “Job Development”.
Click “New”, choose Application → PySpark, name the job, and confirm.
Configure the following parameters and save:
Main Python resource: the uploaded pyspark_ss_demo.py.
Engine version: Spark version (see engine version guide).
Run parameters: internal IP of the EMR core node.
Spark configuration (example):
spark.jars oss://<yourBucket>/kafka-lib/commons-pool2-2.11.1.jar,oss://<yourBucket>/kafka-lib/kafka-clients-2.8.1.jar,oss://<yourBucket>/kafka-lib/spark-sql-kafka-0-10_2.12-3.3.1.jar,oss://<yourBucket>/kafka-lib/spark-token-provider-kafka-0-10_2.12-3.3.1.jar
spark.emr.serverless.network.service.name connection_to_emr_kafkaPublish the job and confirm.
Start the streaming job, then go to “Operations”, click “Start”.
Step 7: View logs
Open the “Log Exploration” tab.
In the Driver logs, click stdOut.log to see execution details.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
