Big Data 9 min read

EMR Studio: Architecture and Features for Simplifying Big Data Development

EMR Studio is a one‑stop, open‑source‑compatible big data development platform that integrates Zeppelin, Jupyter, Airflow and a custom Cluster Manager to streamline job creation, scheduling, monitoring, and cluster switching, thereby addressing common usability challenges in Spark, Flink, Hive, and Presto workflows.

Big Data Technology Architecture

Nov 28, 2021

EMR Studio: Architecture and Features for Simplifying Big Data Development

In November 2021, EMR Studio—a next‑generation open‑source big data development platform—was launched for public beta. It seamlessly connects to EMR clusters (both EMR on ECS and EMR on ACK), offering interactive development, job scheduling, and monitoring services that cover ETL, interactive analytics, machine learning, and real‑time computing scenarios.

The big data ecosystem includes powerful frameworks such as Spark, Flink, Hive, and Presto, but they often suffer from high configuration complexity, slow job submission, incomplete error logs, environment inconsistencies across clusters, and cumbersome Python environment management for PySpark.

EMR Studio was created to solve these pain points, providing a unified, open‑source‑based platform that eliminates the need for manual configuration and improves engineer productivity.

The platform’s architecture is 100% compatible with the open‑source big data ecosystem, allowing users to migrate workloads to the cloud without modifying code. EMR Studio comprises four components grouped into three categories:

Zeppelin and Jupyter – interactive notebooks for job development and submission.

Airflow – production‑grade scheduler.

Cluster Manager – management of Hadoop compute clusters.

Zeppelin and Jupyter normally require setting SPARK_HOME and HIVE_CONF_DIR to run jobs; EMR Studio abstracts these settings so users can submit Spark, Flink, Hive, Presto, ClickHouse, and other jobs directly from the notebooks. It also provides a simple three‑step process to create isolated Python environments for PySpark: create and package a Conda environment in Zeppelin, upload the package to OSS, and configure the PySpark job to use the uploaded environment.

Airflow, the most popular open‑source scheduler, is integrated and enhanced within EMR Studio. Improvements include storing logs and DAG files on OSS, online DAG editing, support for scheduling Zeppelin and Jupyter notebooks, seamless EMR cluster switching, and built‑in Alibaba Cloud monitoring alerts.

Below is an example DAG that uses the ZeppelinOperator to chain three notebook tasks (raw data generation, Spark ETL, and Spark query) into a scheduled workflow:

from airflow import DAG
from datetime import datetime, timedelta
from airflow.contrib.operators.zeppelin_operator import ZeppelinOperator

default_args = {
  'owner': 'airflow',
  'depends_on_past': False,
  'start_date': datetime(2018, 1, 1),
  'email_on_failure': False,
  'email_on_retry': False,
  'retries': 0,
  'retry_delay': timedelta(minutes=5),
}

with DAG('zeppelin_etl_note_dag',
     max_active_runs=5,
     schedule_interval='0 0 * * *',
     default_args=default_args) as dag:
    execution_date = "{{ ds }}"
    raw_data_task = ZeppelinOperator(
        task_id='raw_data_task',
        conn_id='zeppelin_default',
        note_id='2FZWJTTPS',
        params={'dt': execution_date}
    )
    spark_etl_task = ZeppelinOperator(
        task_id='spark_etl_task',
        conn_id='zeppelin_default',
        note_id='2FX3GJW67',
        params={'dt': execution_date}
    )
    spark_query_task = ZeppelinOperator(
        task_id='spark_query_task',
        conn_id='zeppelin_default',
        note_id='2FZ8H4JPV',
        params={'dt': execution_date}
    )
    raw_data_task >> spark_etl_task >> spark_query_task

The custom Cluster Manager component handles configuration and switching of multiple Hadoop clusters. Users can select a target cluster from a dropdown in the notebook UI, enabling seamless transition from development to production environments.

In summary, EMR Studio offers an open‑source‑friendly, lightweight big data development platform that addresses usability challenges across major frameworks, thereby boosting data engineers' efficiency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Engineering Apache Spark Jupyter Airflow EMR Studio Zeppelin

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.