Big Data 12 min read

Boosting Automotive Data Processing with Alibaba Cloud EMR Serverless Spark

This article details how a leading automotive parts supply‑chain platform migrated from a traditional Hadoop stack to Alibaba Cloud EMR Serverless Spark and DataWorks, achieving faster, more elastic, and cost‑effective data processing, enhanced AI integration, and significant operational improvements across multiple business scenarios.

Alibaba Cloud Big Data AI Platform

Jun 10, 2025

Boosting Automotive Data Processing with Alibaba Cloud EMR Serverless Spark

Background

The "One System" automotive parts supply‑chain platform provides high‑quality engine, transmission, and chassis components, combining physical resources with internet technology to offer efficient downstream channels and reliable products for retailers, driving digital transformation.

Facing growing demands for real‑time analytics, AI capabilities, and massive semi‑structured data, the on‑premise big‑data platform hit limits in storage, elasticity, cost, and operational complexity. The rise of cloud computing introduced a Serverless data‑compute architecture, making a next‑generation cloud‑native data platform feasible.

Why Choose Alibaba Cloud EMR Serverless Spark

EMR Serverless Spark is a high‑performance Lakehouse product compatible with open‑source Spark. It offers end‑to‑end services for development, debugging, publishing, scheduling, and operations, simplifying big‑data workflows by removing the need to manage clusters and supporting both batch and streaming workloads.

Rich features: permission management, resource quotas, task isolation, full Spark API compatibility.

Flexible billing: pay only for actual CPU, memory, and execution time.

High engine performance: built‑in Spark Native Engine delivers up to 3× speed over open‑source versions.

Robust service guarantees: dynamic resource allocation, no need to handle cluster provisioning, scaling, or fault recovery.

Technical Solution Design

The platform uses EMR Serverless Spark to integrate data and AI, combined with EMR Serverless StarRocks to build a Lakehouse. Key components include:

Upstream data is ingested via DataWorks, written in Apache Paimon format with automatic compaction, and synchronized to DLF for real‑time metadata.

Serverless Spark constructs a classic multi‑layer data warehouse: ODS (real‑time ingestion), DWD (detail layer), DWS (light aggregation), and ADS (high‑quality metrics) to serve business systems.

BI uses DataWorks‑orchestrated StarRocks tasks and asynchronous materialized views to accelerate lake queries, supporting dashboards and reports.

ML/AI leverages DataWorks‑scheduled Spark jobs to compute and aggregate metrics, pushing results to an AI knowledge base for downstream analytics.

An architecture diagram (shown below) illustrates the use of Serverless Spark with open‑source lake formats (Paimon), various ML/AI toolkits, and the unified DLF lake‑warehouse management platform, delivering fast data processing and AI empowerment.

Data Platform Evolution

The migration proceeded through five stages:

Evaluation: Defined current state and goals, selected a mature unified platform (Alibaba Cloud EMR) supporting both data processing/analysis and data science.

Adaptation: Mapped existing Hadoop tasks, dependencies, and data flows, then adapted jobs to the EMR Serverless environment, ensuring Spark SQL, UDFs, and libraries were compatible.

Migration: Incrementally switched tasks, created new DataWorks workflows, replaced legacy scripts/JARs, and moved data to OSS/OSS‑HDFS for compute‑storage separation.

Optimization: Leveraged Fusion engine for performance gains, used StarRocks visual SQL analysis, and fine‑tuned resource allocation for cost‑effectiveness.

Governance: Unified platform management via DataWorks for scheduling, monitoring, and governance, combining EMR Serverless Spark and StarRocks to simplify the full data‑processing lifecycle.

Serverless Spark Product Advantages

Cloud‑Native Ultra‑Fast Compute Engine: Built‑in Spark Native Engine offers up to 3× performance over open‑source; integrated Celeborn Remote Shuffle Service handles PB‑scale shuffle data, reducing total compute cost by up to 30%.

Elastic Resource Management: Second‑level elasticity, fine‑grained allocation as low as 1 CPU core, task‑ or queue‑level metering for maximum utilization.

Data & AI: Fully compatible PySpark/Python environment, supporting Python ML libraries and Spark MLlib, with managed third‑party dependencies.

Ecosystem Compatibility: Supports DLF and Hive Metastore, compatible with Paimon, Iceberg, Hudi, Delta lake formats, integrates with Airflow, Dolphin Scheduler, Kerberos/LDAP, Ranger, DataWorks, and DBT.

Benefits After Migration

Technical: Adopted Apache Paimon as lake storage, integrated Spark and Flink as compute engines, achieving mature real‑time monitoring and analysis capabilities.

Development Efficiency: Spark SQL session development + DataWorks production scheduling accelerated R&D, ensuring timely data delivery for critical business.

Operations: Multi‑version management, automatic scaling, and fault recovery reduced manual intervention and operational overhead.

Business: Data response time improved from hours to minutes, elastic scaling matched workload peaks while saving costs during low usage periods.

Conclusion and Outlook

By building a new big‑data platform on Alibaba Cloud EMR Serverless Spark, the company gained over three‑fold performance compared to open‑source solutions, achieved compute/storage separation, and dramatically improved data team efficiency. The transition from traditional Hadoop to Serverless Spark represents a qualitative leap in enterprise data capability, laying a cloud‑native foundation for deeper AI integration and future digital growth.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Big Data Data Lake Spark EMR Serverless

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.