Big Data 9 min read

EMR Studio: Architecture and Features for Simplifying Big Data Development

EMR Studio is a one‑stop, open‑source‑compatible big data development platform that integrates Zeppelin, Jupyter, Airflow and a custom Cluster Manager to streamline job creation, scheduling, monitoring, and cluster switching, thereby addressing common usability challenges in Spark, Flink, Hive, and Presto workflows.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
EMR Studio: Architecture and Features for Simplifying Big Data Development

In November 2021, EMR Studio—a next‑generation open‑source big data development platform—was launched for public beta. It seamlessly connects to EMR clusters (both EMR on ECS and EMR on ACK), offering interactive development, job scheduling, and monitoring services that cover ETL, interactive analytics, machine learning, and real‑time computing scenarios.

The big data ecosystem includes powerful frameworks such as Spark, Flink, Hive, and Presto, but they often suffer from high configuration complexity, slow job submission, incomplete error logs, environment inconsistencies across clusters, and cumbersome Python environment management for PySpark.

EMR Studio was created to solve these pain points, providing a unified, open‑source‑based platform that eliminates the need for manual configuration and improves engineer productivity.

The platform’s architecture is 100% compatible with the open‑source big data ecosystem, allowing users to migrate workloads to the cloud without modifying code. EMR Studio comprises four components grouped into three categories:

Zeppelin and Jupyter – interactive notebooks for job development and submission.

Airflow – production‑grade scheduler.

Cluster Manager – management of Hadoop compute clusters.

Zeppelin and Jupyter normally require setting SPARK_HOME and HIVE_CONF_DIR to run jobs; EMR Studio abstracts these settings so users can submit Spark, Flink, Hive, Presto, ClickHouse, and other jobs directly from the notebooks. It also provides a simple three‑step process to create isolated Python environments for PySpark: create and package a Conda environment in Zeppelin, upload the package to OSS, and configure the PySpark job to use the uploaded environment.

Airflow, the most popular open‑source scheduler, is integrated and enhanced within EMR Studio. Improvements include storing logs and DAG files on OSS, online DAG editing, support for scheduling Zeppelin and Jupyter notebooks, seamless EMR cluster switching, and built‑in Alibaba Cloud monitoring alerts.

Below is an example DAG that uses the ZeppelinOperator to chain three notebook tasks (raw data generation, Spark ETL, and Spark query) into a scheduled workflow:

from airflow import DAG
from datetime import datetime, timedelta
from airflow.contrib.operators.zeppelin_operator import ZeppelinOperator

default_args = {
  'owner': 'airflow',
  'depends_on_past': False,
  'start_date': datetime(2018, 1, 1),
  'email_on_failure': False,
  'email_on_retry': False,
  'retries': 0,
  'retry_delay': timedelta(minutes=5),
}

with DAG('zeppelin_etl_note_dag',
     max_active_runs=5,
     schedule_interval='0 0 * * *',
     default_args=default_args) as dag:
    execution_date = "{{ ds }}"
    raw_data_task = ZeppelinOperator(
        task_id='raw_data_task',
        conn_id='zeppelin_default',
        note_id='2FZWJTTPS',
        params={'dt': execution_date}
    )
    spark_etl_task = ZeppelinOperator(
        task_id='spark_etl_task',
        conn_id='zeppelin_default',
        note_id='2FX3GJW67',
        params={'dt': execution_date}
    )
    spark_query_task = ZeppelinOperator(
        task_id='spark_query_task',
        conn_id='zeppelin_default',
        note_id='2FZ8H4JPV',
        params={'dt': execution_date}
    )
    raw_data_task >> spark_etl_task >> spark_query_task

The custom Cluster Manager component handles configuration and switching of multiple Hadoop clusters. Users can select a target cluster from a dropdown in the notebook UI, enabling seamless transition from development to production environments.

In summary, EMR Studio offers an open‑source‑friendly, lightweight big data development platform that addresses usability challenges across major frameworks, thereby boosting data engineers' efficiency.

Data EngineeringBig DataApache SparkJupyterAirflowEMR StudioZeppelin
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.