How to Connect EMR Serverless Spark with Apache Doris for Seamless Data Processing
This guide explains how to integrate EMR Serverless Spark with the high‑performance Apache Doris analytical database, covering prerequisites, connector download, OSS upload, network configuration, table creation, and both SQL‑session and Notebook examples for reading and writing Doris tables.
Background
EMR Serverless Spark is a high‑performance Lakehouse product for Data+AI that provides a one‑stop data platform, including development, debugging, scheduling, and operations, and is 100% compatible with the open‑source Spark ecosystem.
Apache Doris
Apache Doris is a high‑performance, real‑time analytical database suitable for reporting, ad‑hoc queries, and data‑lake federation acceleration.
Prerequisites
Serverless Spark workspace has been created.
Doris cluster has been created (the article uses an EMR on ECS Doris cluster as an example).
Usage Limits
The EMR Serverless Spark engine version must be esr‑2.5.0, esr‑3.1.0, esr‑4.1.0 or later.
Operation Process
Step 1: Obtain Doris Spark Connector JAR and upload to OSS
Download the appropriate Doris Spark Connector JAR from the official GitHub repository. The JAR naming pattern is
spark-doris-connector-spark-${spark_version}-${connector_version}.jar. For example, with engine version esr‑3.1.0 (Spark 3.4.3, Scala 2.12) you would download spark-doris-connector-spark-3.4-24.0.0.jar and upload it to an OSS bucket.
Step 2: Create network connection
Configure a VPC network connection so that EMR Serverless Spark can reach the Doris service. Open the required ports (e.g., HTTP 8031, RPC 9061, WebServer 8041) in the security group.
Step 3: Create database and table in the EMR Doris cluster
mysql -h 127.0.0.1 -P 9031 -u root CREATE DATABASE IF NOT EXISTS testdb; USE testdb; CREATE TABLE test (id INT, name STRING) PROPERTIES("replication_num" = "1"); INSERT INTO test VALUES (1,'a'), (2,'b'), (3,'c'); SELECT * FROM test;Successful query results are shown below:
Step 4: EMR Serverless Spark reads Doris tables
SQL session : Create a SQL session, select the matching engine version, choose the network connection created in Step 2, and add the following Spark configuration to load the connector:
spark.user.defined.jars oss://<bucketname>/path/connector.jarThen create a temporary view:
CREATE TEMPORARY VIEW test USING doris OPTIONS(
"table.identifier" = "testdb.test",
"fenodes" = "<doris_address>:<http_port>",
"user" = "<user>",
"password" = "<password>"
); SELECT * FROM test;Result confirms successful read.
Notebook session : Create a Notebook session with the same engine version and network connection, then run:
dorisSparkDF = spark.read.format("doris") \
.option("doris.table.identifier", "testdb.test") \
.option("doris.fenodes", "<doris_address>:<http_port>") \
.option("user", "<user>") \
.option("password", "<password>") \
.load()
dorisSparkDF.show(3)Successful output confirms the read operation.
Step 5: EMR Serverless Spark writes to Doris tables
SQL session :
CREATE TEMPORARY VIEW test_write USING doris OPTIONS(
"table.identifier" = "testdb.test",
"fenodes" = "<doris_address>:<http_port>",
"user" = "<user>",
"password" = "<password>"
);
INSERT INTO test_write VALUES (4,'d'), (5,'e');
SELECT * FROM test_write;Result shows the new rows.
Notebook session :
data = [(7,'f'), (8,'g')]
mockDataDF = spark.createDataFrame(data, ["id","name"])
mockDataDF.write.mode("append").format("doris") \
.option("doris.table.identifier", "testdb.test") \
.option("doris.fenodes", "<doris_address>:<http_port>") \
.option("user", "<user>") \
.option("password", "<password>") \
.save()
dorisSparkDF = spark.read.format("doris") \
.option("doris.table.identifier", "testdb.test") \
.option("doris.fenodes", "<doris_address>:<http_port>") \
.option("user", "<user>") \
.option("password", "<password>") \
.load()
dorisSparkDF.show(10)Successful output confirms data was written.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
