Big Data 11 min read

Using DeltaLake for Industrial Data Platforms: Distributed Stream Processing, Batch‑Stream Fusion, and Transactional Support

This article shares practical experiences of building an industrial data middle‑platform with DeltaLake, covering heterogeneous distributed stream handling, batch‑stream unified analytics, and transactional/algorithm support to improve data timeliness, reliability, and operational efficiency in manufacturing environments.

Big Data Technology Architecture

Nov 24, 2020

Using DeltaLake for Industrial Data Platforms: Distributed Stream Processing, Batch‑Stream Fusion, and Transactional Support

Author Introduction Zhan Huaimin (aka Xin Du), a big‑data engineer in Alibaba Cloud Digital Industry R&D, focuses on building data middle‑platforms for industrial digital transformation using big data and AI.

Preface Since the release of Cloudwise Industrial Brain 3.0 in 2020, the industrial brain has evolved, and this article presents best practices of using DeltaLake in industrial data middle‑platform construction, including:

Processing heterogeneous distributed stream messages

Batch‑stream unified data analysis

Support for transactions and algorithms

Processing Heterogeneous Distributed Stream Messages Industrial enterprises often have data sources scattered worldwide; a group‑level user expects a data middle‑platform to aggregate these sources. DeltaLake was chosen over Flink, Flume, etc., because it supports regular‑expression subscription to multiple Kafka topics, provides native HDFS support and small‑file merging, and simplifies real‑time writes to HDFS.

Key advantages:

SubscribePattern enables regex‑based consumption of many Kafka topics simultaneously.

Native HDFS support eliminates the need for separate Flink HDFS sink or Flume clusters.

DeltaLake’s ACID guarantees allow consistent writes while handling rolling‑write thresholds; combined with Optimize and Vacuum, small files can be merged or removed automatically, improving HDFS performance.

Batch‑Stream Unified Data Analysis In manufacturing, sensor time‑series data from Kafka are streamed, processed, and written to OLAP stores (e.g., Alibaba Cloud ADB, TSDB, HBase) for low‑latency queries. Because sensor data may be inaccurate initially, a "rolling overlay" approach is used: real‑time incremental zone (orange) and periodic correction zone (blue). The traditional Lambda architecture requires separate stream and batch engines, leading to duplicated code and maintenance overhead.

By adopting DeltaLake with Spark’s native stream‑batch integration, the same Spark job can handle both real‑time ingestion and periodic correction, leveraging DeltaLake’s ACID, Optimize, and Time‑Travel features. Alibaba Cloud EMR further wraps SparkSQL/Streaming with a SQL layer similar to Flink SQL, lowering the development barrier.

Transaction Handling and Algorithm Support Industrial scheduling and planning require transactional guarantees and frequent data merges from multiple systems (ERP, WMS, MES). Previously, CDC or polling fed data into relational databases, then the scheduling engine queried the RDB for computation, causing scalability and latency issues.

The new architecture replaces the RDB with HDFS + Spark + DeltaLake:

HDFS + Spark stores large‑scale data, solving storage bottlenecks.

Spark Streaming + DeltaLake handles real‑time ingestion, using DeltaLake’s MERGE and Optimize to maintain data freshness and performance.

Scheduling algorithms are packaged as Spark jobs, executed on the compute platform, benefiting from Spark ML and Python support for iterative planning.

DeltaLake’s Time‑Travel enables version management and rollback, aiding model debugging and evaluation.

Summary 1) DeltaLake’s ACID capabilities greatly benefit latency‑sensitive and accurate algorithmic applications. 2) Combining DeltaLake Optimize + Vacuum with streaming ingestion improves compatibility with massive Kafka feeds and reduces operational costs. 3) Alibaba Cloud EMR’s Streaming SQL lowers development and maintenance effort for large‑scale data‑platform projects. DeltaLake is still experimental in the Industrial Brain, but its use in streaming ingestion, scheduling engines, and batch‑stream fusion is being standardized and will soon extend to other manufacturing sectors with AI assistance.

For more information, see https://www.aliyun.com/solution/industry/home

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Spark Batch-Stream Fusion DeltaLake industrial data platform

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.