How to Read and Write StarRocks Data with EMR Serverless Spark
This step‑by‑step guide explains how to use EMR Serverless Spark together with the StarRocks Spark Connector to create a workspace, upload the connector JAR, configure network connections, create databases and tables in StarRocks, and perform read/write operations via SQL sessions, Notebook sessions, or batch Spark jobs, complete with code examples and UI screenshots.
Overview
EMR Serverless Spark is a high‑performance lakehouse product compatible with open‑source Spark, offering end‑to‑end development, debugging, publishing, scheduling, and operations services that simplify big‑data workflows.
StarRocks provides an official Spark Connector for seamless data exchange between Spark and StarRocks. This article demonstrates how to read from and write to StarRocks using EMR Serverless Spark.
Prerequisites
Created a Serverless Spark workspace (see the "Create Workspace" link).
Created an EMR Serverless StarRocks instance (see the "Create Instance" link).
Step 1: Obtain and Upload Spark Connector JAR
Download the appropriate Spark Connector JAR from Maven Central or the StarRocks documentation, then upload it to Alibaba Cloud OSS.
The JAR naming pattern is
starrocks-spark-connector-${spark_version}_${scala_version}-${connector_version}.jar. For example, with engine version esr-4.1.0 (Spark 3.5.2, Scala 2.12) and connector version 1.1.2, the JAR is starrocks-spark-connector-3.5_2.12-1.1.2.jar.
Step 2: Add Network Connection
Retrieve the VPC and switch information from the StarRocks instance details page, then add a network connection in the EMR Serverless Spark workspace.
Navigate to the Spark workspace, open the Network Connection page, and click Add Network Connection .
Enter a connection name and select the VPC and switch obtained from StarRocks.
Step 3: Create Database and Table in StarRocks
Connect to the StarRocks instance and run the following SQL statements:
<span>CREATE DATABASE `testdb`;</span>
<span>CREATE TABLE `testdb`.`score_board` (</span>
<span> `id` int(11) NOT NULL COMMENT "",</span>
<span> `name` varchar(65533) NULL DEFAULT "" COMMENT "",</span>
<span> `score` int(11) NOT NULL DEFAULT "0" COMMENT ""</span>
<span>) ENGINE=OLAP PRIMARY KEY(`id`) COMMENT "OLAP" DISTRIBUTED BY HASH(`id`);</span>Method 1: Use SQL Session or Notebook Session
SQL Session
Create an SQL session, select the appropriate engine version, and add the connector JAR path in the Spark configuration:
spark.user.defined.jars oss://<bucketname>/path/connector.jarThen create a StarRocks table and insert data:
CREATE TABLE score_board USING starrocks OPTIONS (
"starrocks.table.identifier" = "testdb.score_board",
"starrocks.fe.http.url" = "<fe_host>:<fe_http_port>",
"starrocks.fe.jdbc.url" = "jdbc:mysql://<fe_host>:<fe_query_port>",
"starrocks.user" = "<user>",
"starrocks.password" = "<password>"
);
INSERT INTO `score_board` VALUES (1, "starrocks", 100), (2, "spark", 100);Create a temporary view to query the data:
CREATE TEMPORARY VIEW test_view USING starrocks OPTIONS (
"starrocks.table.identifier" = "testdb.score_board",
"starrocks.fe.http.url" = "<fe_host>:<fe_http_port>",
"starrocks.fe.jdbc.url" = "jdbc:mysql://<fe_host>:<fe_query_port>",
"starrocks.user" = "<user>",
"starrocks.password" = "<password>"
);
SELECT * FROM test_view;Notebook Session
Create a Notebook session, configure the same connector JAR, and run Python code to create the table and view:
# Replace with your StarRocks configuration
fe_host = "<fe_host>"
fe_http_port = "<fe_http_port>"
fe_query_port = "<fe_query_port>"
user = "<user>"
password = "<password>"
# Create table
create_table_sql = f"""
CREATE TABLE score_board USING starrocks OPTIONS (
"starrocks.table.identifier" = "testdb.score_board",
"starrocks.fe.http.url" = "{fe_host}:{fe_http_port}",
"starrocks.fe.jdbc.url" = "jdbc:mysql://{fe_host}:{fe_query_port}",
"starrocks.user" = "{user}",
"starrocks.password" = "{password}"
);
"""
spark.sql(create_table_sql)
# Insert data
insert_data_sql = """
INSERT INTO `score_board` VALUES (1, "starrocks", 100), (2, "spark", 100)
"""
spark.sql(insert_data_sql)
# Create view and query
create_view_sql = f"""
CREATE TEMPORARY VIEW test_view USING starrocks OPTIONS (
"starrocks.table.identifier" = "testdb.score_board",
"starrocks.fe.http.url" = "{fe_host}:{fe_http_port}",
"starrocks.fe.jdbc.url" = "jdbc:mysql://{fe_host}:{fe_query_port}",
"starrocks.user" = "{user}",
"starrocks.password" = "{password}"
);
"""
spark.sql(create_view_sql)
result_df = spark.sql("SELECT * FROM test_view")
result_df.show()Method 2: Use Spark Batch Task
Create a Spark batch SQL task, configure the connector JAR path, set engine version, network connection, and Spark parameters, then run the task.
spark.user.defined.jars oss://<bucketname>/path/connector.jarUpload the SQL file spark_sql_starrocks.sql containing the same StarRocks table creation and query statements, then execute the task and view logs via the Run Record and Log Explorer pages.
Conclusion
By following these steps, you can efficiently integrate EMR Serverless Spark with StarRocks for both batch and interactive workloads, leveraging the Spark Connector to perform data ingestion, transformation, and query operations within a fully managed serverless environment.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
