Big Data 12 min read

Building a Scalable IoT Data Platform with Alibaba EMR Serverless Spark

Midea Building Technology shares how its IoT data platform leverages Alibaba Cloud EMR Serverless Spark, Hudi Lakehouse, and Serverless StarRocks to achieve real‑time ingestion, massive scale processing, AI‑driven analytics, and significant performance and cost improvements for building‑system management.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Building a Scalable IoT Data Platform with Alibaba EMR Serverless Spark

Background

Midea Building Technology, a division of Midea Group, manages a wide range of HVAC and building‑automation products deployed in over 200 countries. The rapid growth of device data, its semi‑structured nature, and limited analytical capabilities of the legacy system created a strong need for a unified, elastic, and lightweight IoT data platform that supports large‑scale processing, AI, and precise decision‑making for energy saving, equipment management, and operation‑maintenance.

Architecture Overview

The platform is built on Alibaba Cloud EMR Serverless Spark and adopts a Lakehouse architecture that combines Apache Hudi for lake storage, DLF for metadata synchronization, and Serverless StarRocks for fast analytics. The core components are illustrated in the diagram below.

Lakehouse architecture diagram
Lakehouse architecture diagram

Data Ingestion and Lakehouse

Sensor data is first sent to cloud Kafka. Serverless Spark Structured Streaming consumes the data and writes it in real time to Hudi tables using the Apache Hudi format. The data flows through three layers:

Bronze : Raw data appended/upserted to a single source‑of‑truth Hudi table.

Silver : Cleaned and transformed data, including complex time‑series calculations packaged as Pandas UDFs.

Gold : Schema‑enforced, high‑quality data used for ad‑hoc queries and data‑science workloads.

Compaction and Z‑ordering are scheduled daily to merge small files and optimize data layout, achieving more than ten‑fold query acceleration and reducing storage costs.

AI and Analytics

Serverless Spark PySpark jobs, together with PyArrow UDFs, aggregate trillion‑level IoT records across millions of dimensions, enabling Data+AI use cases such as energy‑consumption optimization and fault‑prediction. Processed metrics are loaded into StarRocks for dashboards and reporting. Jupyter Notebook integration allows data scientists to develop and schedule PySpark jobs, and an OSS+MLflow+Serverless Spark stack supports MLOps workflows.

Why EMR Serverless Spark

Key pain points addressed include:

Eliminating costly, time‑consuming POC cluster provisioning.

Providing the performance needed for trillion‑scale IoT streams.

Supporting batch, streaming, interactive, and machine‑learning workloads within a unified Spark ecosystem.

Offering elastic compute that shortens data latency for monthly reports.

Delivering robust Data+AI capabilities.

Compared with the previous architecture, EMR Serverless Spark delivers over 50% performance gains and reduces overall costs by roughly 30%.

Performance & Cost Benefits

The serverless model removes operational overhead, while the built‑in Fusion engine, vectorized execution, and RSS capabilities provide more than three times the performance of open‑source Spark. Compute‑storage separation further lowers expenses.

Conclusion

Midea Building Technology successfully built an IoT data processing platform on Alibaba Cloud EMR Serverless Spark, achieving high elasticity, strong AI support, and significant productivity gains. Future plans include deeper collaboration with Alibaba Cloud to deliver more industry‑specific IoT solutions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataPythonIoTData LakeLakehouseHudiEMR Serverless Spark
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.