Big Data 16 min read

How EMR Serverless Spark Powers the Next‑Gen Lakehouse Era

This article traces the evolution of data platforms, explains the rise of lakehouse architecture, and details how Alibaba Cloud's EMR Serverless Spark delivers one‑stop development, high performance, and full ecosystem compatibility, illustrated with real‑world case studies from Midea and Eagle Network.

Alibaba Cloud Big Data AI Platform

Oct 31, 2024

How EMR Serverless Spark Powers the Next‑Gen Lakehouse Era

Data Platform Evolution

From the 1960s‑70s emergence of database technology to the early 2000s dominance of single‑node databases, the rapid growth of internet applications exposed the scalability limits of traditional databases. The launch of Google File System (2003) and MapReduce (2004) ushered in the big‑data era, followed by Hadoop, HBase, Hive, and later Spark, Flink, and Presto as mainstream processing engines.

The 2007 iPhone sparked mobile internet, leading to diverse workloads. Around 2017, multimedia content drove the need for flexible data frameworks, giving rise to data lakes. In 2022, large language models accelerated AI‑generated content, expanding unstructured and multimodal data and creating new challenges for data processing.

Lakehouse Architecture

Traditional data warehouses offer high structure but incur high storage costs and limited flexibility, while data lakes provide low‑cost, flexible storage but lack strong transaction support. A hybrid approach combines the strengths of both, yet introduces data consistency and redundancy overhead.

The emerging lakehouse (Lakehouse) architecture integrates data lake storage with warehouse capabilities, consisting of three layers:

Storage layer: ensures data transactionality, consistency, and efficient storage.

Management layer: provides unified metadata management for structured, semi‑structured, and unstructured data, addressing governance and security.

Analytics layer: built on multiple compute engines, with Apache Spark as the flagship project.

Databricks offers a managed Spark‑based lakehouse globally, but lacks a domestic counterpart. Alibaba Cloud introduced EMR Serverless Spark to fill this gap.

EMR Serverless Spark Features

One‑Stop Data Development

EMR Serverless Spark supports job development, debugging, publishing, and scheduling for ETL, interactive analytics, and Python‑based data science. It includes version management, built‑in workflow scheduling, and comprehensive resource monitoring.

Built‑in SQL Editor

The platform provides an SQL editor for interactive and ETL queries, supporting multiple resource queues, session management, and metadata views for table operations.

Notebook Interactive Environment

Beyond SQL, a Notebook environment enables Python development with custom library installation (e.g., Pandas) for AI and data science workloads.

Workflow Scheduling

Developed jobs can be orchestrated via a visual workflow engine, supporting drag‑and‑drop topology design.

Metrics Dashboard

Real‑time dashboards display Spark task metrics such as CPU, memory, JVM, driver, executor I/O, and shuffle, aiding performance tuning and fault diagnosis.

Resource Observation

Resources can be partitioned by department or business line, with dynamic quota adjustments for queues.

Monitoring & Diagnosis

One‑click job diagnostics automatically detect issues like data skew or garbage collection and provide optimization recommendations.

Fusion Engine Performance

The proprietary Fusion Engine delivers extreme performance: a native C++ vectorized SQL engine leverages SIMD for CPU‑intensive workloads, while a Remote Shuffle Service (based on the open‑source Celeborn project) accelerates I/O‑intensive tasks with multi‑tenant isolation.

TPC‑DS benchmarks show a ~5× speedup over Apache Spark on 10 TB data and ~44% improvement over Databricks on 100 TB, with a three‑fold cost advantage.

Full Ecosystem Compatibility

EMR Serverless Spark integrates with Alibaba Cloud DLF 2.0 metadata, OSS storage, Hive Metastore, and supports Livy, Thrift Server, JDBC, OpenAPI‑based Airflow and DolphinScheduler operators, as well as Spark‑submit.

Customer Cases

Midea Group

Midea built a lakehouse entirely on EMR Serverless Spark, using Spark Streaming to ingest industrial device data into Hudi‑based storage, performing compaction, ETL, and AI analytics via Notebook with custom Python libraries, and exporting results to StarRocks for BI.

Eagle Network (Shanghai)

Eagle Network adopted an EMR Serverless Spark‑based solution, using Flink CDC for data ingestion into Paimon, processing with Spark, and orchestrating workflows via Airflow and DolphinScheduler operators. They also leverage StarRocks for OLAP and Superset for visualization, with support for overseas regions.

Demo

A demo showcases building a lakehouse for an automotive sales scenario, covering data loading, ETL, visualization, and predictive analysis using EMR Serverless Spark, DLF 2.0, and Paimon.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data AI Data Platform Lakehouse EMR Serverless Spark

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.