Big Data 6 min read

How Kubernetes Powers Cloud‑Native Big Data with EMR on ACK

This article explains the shift of big data and machine‑learning workloads toward storage‑compute separation and cloud‑native architectures, outlines the technical challenges of running Spark on Kubernetes, and details the EMR on ACK solution with its architecture, performance gains, and real‑world adoption.

Alibaba Cloud Native

Jan 9, 2023

How Kubernetes Powers Cloud‑Native Big Data with EMR on ACK

Overview of Cloud‑Native Big Data on Kubernetes

Big‑data and machine‑learning workloads are moving toward a storage‑compute separation model and adopting cloud‑native platforms. Spark, for example, can run on traditional Hadoop schedulers in on‑premise environments, while on public clouds it must exploit elastic resources, centralized operations, and object‑storage services. This shift has driven many Spark‑on‑Kubernetes deployments.

Technical Challenges of Cloud‑Native Big Data

Building an HDFS‑compatible file system on Alibaba Cloud Object Storage (OSS) that matches HDFS performance while lowering cost.

Separating shuffle data from compute nodes and supporting heterogeneous Alibaba Cloud Container Service (ACK) node types.

Enabling Spark dynamic resource allocation (e.g., Spark‑25299) in a cloud‑native context.

Kubernetes‑Based Scheduling Optimizations

After introducing Kubernetes, the focus is on eliminating performance bottlenecks to achieve Yarn‑level throughput, implementing multi‑level queue management, and using peak‑valley scheduling to shift workloads to off‑peak periods for higher cluster utilization.

EMR 2.0 on ACK Architecture

In December, Alibaba Cloud released EMR 2.0, which can be deployed directly on the ACK platform. This decouples big‑data job execution from underlying cluster management, allowing users to concentrate on data processing logic. Open‑source engines such as Spark, Presto, and Flink run on ACK with full compatibility and performance that exceeds upstream versions.

Key Architectural Features

Lightweight control plane that integrates with existing data platforms.

Job submission from data‑development or scheduling clusters to multiple execution back‑ends.

Off‑peak (peak‑valley) scheduling based on business load patterns.

Cloud‑native data‑lake architecture leveraging ACK’s elastic scaling.

ACK manages heterogeneous node types, providing flexible resource mixes.

Performance‑Focused Advantages

Remote Shuffle Service : Provides storage‑compute separation for intermediate shuffle data, allowing compute nodes to run without local or cloud disks.

Spark Dynamic Resource Allocation : Fully supports Spark‑25299 for on‑the‑fly executor scaling.

JindoFS Acceleration : Optimizes OSS access; Block mode delivers >15% performance gain on a 1 TB TPC‑DS benchmark.

Scheduler Framework V2 : Improves scheduling throughput by >3× compared with the community scheduler and adds multi‑level queue management.

Engine Enhancements : EMR Spark achieves 3× higher throughput than the open‑source version on a 10 TB TPC‑DS benchmark; Hudi and DeltaLake receive functional and performance upgrades.

Comprehensive Off‑Peak Scheduling : Enables coordinated batch and streaming jobs to share the same ACK cluster, increasing overall machine utilization.

Real‑World Deployment Example

The advertising technology provider Huami (汇量科技) has operated EMR for four years. After upgrading to EMR 2.0, the company observed multiple‑fold improvements in data synchronization and query latency for its material platform and heat‑engine services, along with higher system stability and the elimination of previous CPU, memory, and I/O bottlenecks.

Reference

For detailed documentation, see

https://help.aliyun.com/document_detail/280450.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Spark EMR ACK

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Overview of Cloud‑Native Big Data on Kubernetes

Technical Challenges of Cloud‑Native Big Data

Kubernetes‑Based Scheduling Optimizations

EMR 2.0 on ACK Architecture

Key Architectural Features

Performance‑Focused Advantages

Real‑World Deployment Example

Reference

Alibaba Cloud Native

How this landed with the community

Was this worth your time?

0 Comments

EMR 2.0 on ACK Architecture