Big Data 11 min read

How to Connect EMR Serverless Spark with Apache Doris for Seamless Data Processing

This guide explains how to integrate EMR Serverless Spark with the high‑performance Apache Doris analytical database, covering prerequisites, connector download, OSS upload, network configuration, table creation, and both SQL‑session and Notebook examples for reading and writing Doris tables.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How to Connect EMR Serverless Spark with Apache Doris for Seamless Data Processing

Background

EMR Serverless Spark is a high‑performance Lakehouse product for Data+AI that provides a one‑stop data platform, including development, debugging, scheduling, and operations, and is 100% compatible with the open‑source Spark ecosystem.

Apache Doris

Apache Doris is a high‑performance, real‑time analytical database suitable for reporting, ad‑hoc queries, and data‑lake federation acceleration.

Prerequisites

Serverless Spark workspace has been created.

Doris cluster has been created (the article uses an EMR on ECS Doris cluster as an example).

Usage Limits

The EMR Serverless Spark engine version must be esr‑2.5.0, esr‑3.1.0, esr‑4.1.0 or later.

Operation Process

Step 1: Obtain Doris Spark Connector JAR and upload to OSS

Download the appropriate Doris Spark Connector JAR from the official GitHub repository. The JAR naming pattern is

spark-doris-connector-spark-${spark_version}-${connector_version}.jar

. For example, with engine version esr‑3.1.0 (Spark 3.4.3, Scala 2.12) you would download spark-doris-connector-spark-3.4-24.0.0.jar and upload it to an OSS bucket.

Step 2: Create network connection

Configure a VPC network connection so that EMR Serverless Spark can reach the Doris service. Open the required ports (e.g., HTTP 8031, RPC 9061, WebServer 8041) in the security group.

Step 3: Create database and table in the EMR Doris cluster

mysql -h 127.0.0.1 -P 9031 -u root
CREATE DATABASE IF NOT EXISTS testdb;
USE testdb;
CREATE TABLE test (id INT, name STRING) PROPERTIES("replication_num" = "1");
INSERT INTO test VALUES (1,'a'), (2,'b'), (3,'c');
SELECT * FROM test;

Successful query results are shown below:

Query result
Query result

Step 4: EMR Serverless Spark reads Doris tables

SQL session : Create a SQL session, select the matching engine version, choose the network connection created in Step 2, and add the following Spark configuration to load the connector:

spark.user.defined.jars oss://<bucketname>/path/connector.jar

Then create a temporary view:

CREATE TEMPORARY VIEW test USING doris OPTIONS(
  "table.identifier" = "testdb.test",
  "fenodes" = "<doris_address>:<http_port>",
  "user" = "<user>",
  "password" = "<password>"
);
SELECT * FROM test;

Result confirms successful read.

Notebook session : Create a Notebook session with the same engine version and network connection, then run:

dorisSparkDF = spark.read.format("doris") \
  .option("doris.table.identifier", "testdb.test") \
  .option("doris.fenodes", "<doris_address>:<http_port>") \
  .option("user", "<user>") \
  .option("password", "<password>") \
  .load()

dorisSparkDF.show(3)

Successful output confirms the read operation.

Notebook read result
Notebook read result

Step 5: EMR Serverless Spark writes to Doris tables

SQL session :

CREATE TEMPORARY VIEW test_write USING doris OPTIONS(
  "table.identifier" = "testdb.test",
  "fenodes" = "<doris_address>:<http_port>",
  "user" = "<user>",
  "password" = "<password>"
);

INSERT INTO test_write VALUES (4,'d'), (5,'e');
SELECT * FROM test_write;

Result shows the new rows.

SQL write result
SQL write result

Notebook session :

data = [(7,'f'), (8,'g')]
mockDataDF = spark.createDataFrame(data, ["id","name"])
mockDataDF.write.mode("append").format("doris") \
  .option("doris.table.identifier", "testdb.test") \
  .option("doris.fenodes", "<doris_address>:<http_port>") \
  .option("user", "<user>") \
  .option("password", "<password>") \
  .save()

dorisSparkDF = spark.read.format("doris") \
  .option("doris.table.identifier", "testdb.test") \
  .option("doris.fenodes", "<doris_address>:<http_port>") \
  .option("user", "<user>") \
  .option("password", "<password>") \
  .load()

dorisSparkDF.show(10)

Successful output confirms data was written.

Notebook write result
Notebook write result
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataData IntegrationApache DorisEMR Serverless SparkSpark Connector
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.