Big Data 13 min read

How Alibaba Cloud EMR Evolved from Open‑Source Compatibility to Enterprise‑Grade Performance

This article outlines Alibaba Cloud EMR's three‑stage evolution—compatibility, contribution, and beyond open source—detailing its early Hadoop adoption, Flink and Spark innovations, cloud‑native optimizations, and enterprise‑grade features such as Remote Shuffle Service, performance benchmarks, and integrated diagnostics.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How Alibaba Cloud EMR Evolved from Open‑Source Compatibility to Enterprise‑Grade Performance

Abstract: This article compiles the presentation by Alibaba Cloud senior technical expert Wu Wei at the Alibaba Cloud EMR 2.0 online launch, dividing the content into three stages: Compatibility with open source, Contribution to open source, and Going beyond open source.

Compatibility Open‑Source Stage

Contribution Open‑Source Stage

Beyond Open‑Source Stage

Compatibility Open‑Source Stage

Open source has become a hot trend in recent years, especially in the big‑data field where open‑source technologies drive evolution and become industry standards. Alibaba Cloud EMR integrates mainstream engines such as Spark, Flink, and StarRocks on a shared compute and data‑lake foundation, ensuring compatibility with the Alibaba Cloud ecosystem.

Alibaba began investing in open‑source big data over a decade ago; today its open‑source platform is a core component of Alibaba's data‑technology system.

In 2008‑2009, Alibaba’s e‑commerce业务爆发, prompting the adoption of Apache Hadoop. The first cluster reached 200 nodes, scaling to 1,000 within a year and later exceeding 10,000 nodes across data centers, demonstrating the critical role of open‑source big data for Alibaba’s core business.

Contribution Open‑Source Stage

After 2014, Alibaba accumulated extensive open‑source experience and launched the EMR product on Alibaba Cloud in 2016, meeting growing demand for cloud‑based big data.

In 2015, real‑time recommendation needs led to the adoption of Apache Flink, which was deployed in 2016 and later open‑sourced as the Blink branch and other projects such as Flink CDC and Flink Table Store.

Since its public‑cloud launch in 2016, EMR has served thousands of enterprises, upgraded from classic Hadoop to a data‑lake architecture with compute‑storage separation, and partnered with global open‑source vendors like Elasticsearch, Cloudera, and Databricks.

Example: the Apache Celeborn project, originating from Alibaba’s internal Remote Shuffle Service, was donated to the Apache Incubator in October 2022, marking Alibaba’s first Apache incubator project.

With the rise of cloud‑native architectures, EMR introduced a Remote Shuffle Service that supports all major engines (Spark, Hive, Flink). The service, open‑sourced in 2021, attracted contributions from companies like Xiaomi, Shopee, and NetEase, and was later donated to the Apache Foundation as Celeborn.

Beyond Open‑Source Stage

Beyond compatibility and contribution, EMR Spark delivers enterprise‑grade capabilities. It has set industry records in benchmarks such as CloudSort (100 TB sort at $1.44 per TB) and TPC‑DS, becoming the first public‑cloud product certified by TPC.

Performance optimizations include a native code‑gen engine, SIMDJSON‑accelerated JSON parsing, a new Join Reorder algorithm, Bloom‑filter‑based partition pruning, and deep integration with JindoFS for storage‑compute separation.

Shuffle performance is enhanced by Push‑Based Shuffle using Celeborn’s Remote Shuffle Service and columnar Shuffle with compression, reducing I/O and network traffic dramatically for massive jobs.

Enterprise diagnostics are provided by EMR Doctor, which collects asynchronous Spark job metrics and stores them in a metadata warehouse for offline analysis, offering health scores and optimization suggestions. Historical optimization (HBO) can improve TPC‑DS query performance by 28 %.

EMR Spark runs on Alibaba Cloud Kubernetes (ACK) and Elastic Container Instances (ECI), supporting both native ACK deployment and independent RSS clusters, enabling dynamic resource scaling similar to Hadoop YARN.

Integration with Alibaba Cloud’s fully managed HDFS (OSS‑HDFS) and Data Lake Formation (DLF) adds native readers for Parquet/ORC, small‑file merging, and lifecycle management. Support for Delta Lake and Hudi includes slow‑changing dimension handling, checkpointing, time‑travel, and a managed Hudi Metastore, with early adoption of Hudi CDC.

Overall, EMR Spark builds on open‑source foundations to deliver a mature, high‑performance, cost‑effective, and cloud‑native big‑data solution.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeBig DataFlinkopen sourceAlibaba CloudSparkEMR
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.