Big Data 14 min read

How to Read and Write StarRocks Data with EMR Serverless Spark

This step‑by‑step guide explains how to use EMR Serverless Spark together with the StarRocks Spark Connector to create a workspace, upload the connector JAR, configure network connections, create databases and tables in StarRocks, and perform read/write operations via SQL sessions, Notebook sessions, or batch Spark jobs, complete with code examples and UI screenshots.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How to Read and Write StarRocks Data with EMR Serverless Spark

Overview

EMR Serverless Spark is a high‑performance lakehouse product compatible with open‑source Spark, offering end‑to‑end development, debugging, publishing, scheduling, and operations services that simplify big‑data workflows.

StarRocks provides an official Spark Connector for seamless data exchange between Spark and StarRocks. This article demonstrates how to read from and write to StarRocks using EMR Serverless Spark.

Prerequisites

Created a Serverless Spark workspace (see the "Create Workspace" link).

Created an EMR Serverless StarRocks instance (see the "Create Instance" link).

Step 1: Obtain and Upload Spark Connector JAR

Download the appropriate Spark Connector JAR from Maven Central or the StarRocks documentation, then upload it to Alibaba Cloud OSS.

The JAR naming pattern is

starrocks-spark-connector-${spark_version}_${scala_version}-${connector_version}.jar

. For example, with engine version esr-4.1.0 (Spark 3.5.2, Scala 2.12) and connector version 1.1.2, the JAR is starrocks-spark-connector-3.5_2.12-1.1.2.jar.

Step 2: Add Network Connection

Retrieve the VPC and switch information from the StarRocks instance details page, then add a network connection in the EMR Serverless Spark workspace.

Navigate to the Spark workspace, open the Network Connection page, and click Add Network Connection .

Enter a connection name and select the VPC and switch obtained from StarRocks.

Step 3: Create Database and Table in StarRocks

Connect to the StarRocks instance and run the following SQL statements:

<span>CREATE DATABASE `testdb`;</span>
<span>CREATE TABLE `testdb`.`score_board` (</span>
<span>    `id` int(11) NOT NULL COMMENT "",</span>
<span>    `name` varchar(65533) NULL DEFAULT "" COMMENT "",</span>
<span>    `score` int(11) NOT NULL DEFAULT "0" COMMENT ""</span>
<span>) ENGINE=OLAP PRIMARY KEY(`id`) COMMENT "OLAP" DISTRIBUTED BY HASH(`id`);</span>

Method 1: Use SQL Session or Notebook Session

SQL Session

Create an SQL session, select the appropriate engine version, and add the connector JAR path in the Spark configuration:

spark.user.defined.jars oss://<bucketname>/path/connector.jar

Then create a StarRocks table and insert data:

CREATE TABLE score_board USING starrocks OPTIONS (
  "starrocks.table.identifier" = "testdb.score_board",
  "starrocks.fe.http.url" = "<fe_host>:<fe_http_port>",
  "starrocks.fe.jdbc.url" = "jdbc:mysql://<fe_host>:<fe_query_port>",
  "starrocks.user" = "<user>",
  "starrocks.password" = "<password>"
);
INSERT INTO `score_board` VALUES (1, "starrocks", 100), (2, "spark", 100);

Create a temporary view to query the data:

CREATE TEMPORARY VIEW test_view USING starrocks OPTIONS (
  "starrocks.table.identifier" = "testdb.score_board",
  "starrocks.fe.http.url" = "<fe_host>:<fe_http_port>",
  "starrocks.fe.jdbc.url" = "jdbc:mysql://<fe_host>:<fe_query_port>",
  "starrocks.user" = "<user>",
  "starrocks.password" = "<password>"
);
SELECT * FROM test_view;

Notebook Session

Create a Notebook session, configure the same connector JAR, and run Python code to create the table and view:

# Replace with your StarRocks configuration
fe_host = "<fe_host>"
fe_http_port = "<fe_http_port>"
fe_query_port = "<fe_query_port>"
user = "<user>"
password = "<password>"

# Create table
create_table_sql = f"""
CREATE TABLE score_board USING starrocks OPTIONS (
  "starrocks.table.identifier" = "testdb.score_board",
  "starrocks.fe.http.url" = "{fe_host}:{fe_http_port}",
  "starrocks.fe.jdbc.url" = "jdbc:mysql://{fe_host}:{fe_query_port}",
  "starrocks.user" = "{user}",
  "starrocks.password" = "{password}"
);
"""
spark.sql(create_table_sql)

# Insert data
insert_data_sql = """
INSERT INTO `score_board` VALUES (1, "starrocks", 100), (2, "spark", 100)
"""
spark.sql(insert_data_sql)

# Create view and query
create_view_sql = f"""
CREATE TEMPORARY VIEW test_view USING starrocks OPTIONS (
  "starrocks.table.identifier" = "testdb.score_board",
  "starrocks.fe.http.url" = "{fe_host}:{fe_http_port}",
  "starrocks.fe.jdbc.url" = "jdbc:mysql://{fe_host}:{fe_query_port}",
  "starrocks.user" = "{user}",
  "starrocks.password" = "{password}"
);
"""
spark.sql(create_view_sql)
result_df = spark.sql("SELECT * FROM test_view")
result_df.show()

Method 2: Use Spark Batch Task

Create a Spark batch SQL task, configure the connector JAR path, set engine version, network connection, and Spark parameters, then run the task.

spark.user.defined.jars oss://<bucketname>/path/connector.jar

Upload the SQL file spark_sql_starrocks.sql containing the same StarRocks table creation and query statements, then execute the task and view logs via the Run Record and Log Explorer pages.

Conclusion

By following these steps, you can efficiently integrate EMR Serverless Spark with StarRocks for both batch and interactive workloads, leveraging the Spark Connector to perform data ingestion, transformation, and query operations within a fully managed serverless environment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataStarRocksData IntegrationSparkemr serverless
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.