Insights from the Real-Time Big Data Meetup: Spark Structured Streaming Overview
The meetup on September 8, co‑hosted by InfoQ and Huawei Cloud, featured Databricks engineer Tathagata Das explaining Spark Structured Streaming’s concepts, fault‑tolerance, performance, event‑time handling, and real‑world use cases such as Apple’s security platform, highlighting its scalability and integration with various data sources.
To help developers gain a deeper understanding of three open‑source big‑data technologies and their practical scenarios, on September 8 InfoQ jointly with Huawei Cloud held a real‑time big‑data meetup, gathering expert speakers from Databricks, Huawei and Meituan Dianping.
Tathagata Das (TD), a core developer of Spark Structured Streaming and Databricks engineer, opened with an introduction to Structured Streaming’s basic concepts and its features in storage, auto‑streaming, fault‑tolerance, performance, event‑time processing, and presented several real‑world use cases.
First, TD clearly explained the problems and concepts of stream processing, noting that its significant complexity makes building robust pipelines difficult.
Data comes in various formats (JSON, Avro, binary), may be dirty, untimely, and unordered.
The loading process is complex; event‑time processing must support interactive queries and integration with machine learning.
Different storage systems and formats (SQL, NoSQL, Parquet, etc.) require fault‑tolerance considerations.
Because it runs on the Spark SQL engine, Spark Structured Streaming naturally inherits Spark’s performance, scalability, and fault‑tolerance, along with a rich, unified high‑level API that simplifies handling complex data and workflows, and benefits from Spark’s extensive ecosystem.
A stream is defined as an unbounded table where new data is appended; its query can be broken into steps such as reading JSON from Kafka, parsing, storing into a structured Parquet table, and ensuring end‑to‑end fault‑tolerance. Its features include:
Support for multiple message queues, e.g., Files, Kafka, Kinesis.
Ability to join() or union() different data sources.
Returns a DataFrame that represents an unbounded table.
Choice of SQL (BI analysis), DataFrame (data‑science), or DataSet (engine) with nearly identical semantics and performance.
Conversion of Kafka JSON records to strings, generating nested columns using optimized functions like from_json(), with support for custom functions such as lambdas and flatMap.
Sink step can write to external storage like Parquet; Kafka sink supports foreach for arbitrary output processing, transactions, and exactly‑once semantics.
Supports fixed‑interval micro‑batch processing with high performance, low‑latency continuous processing (Spark 2.3), and checkpointing.
Second‑level processing of structured Kafka source data, ready for immediate querying.
Spark SQL transforms batch queries into a series of incremental execution plans, enabling batch‑wise data operations.
For fault‑tolerance, Structured Streaming uses checkpointing, writing offsets to stable storage in JSON format, allowing recovery from any error point and guaranteeing exactly‑once semantics.
In terms of performance, Structured Streaming reuses the Spark SQL optimizer and Tungsten engine, reducing cost by threefold; more details are available in the author’s blog.
Structured Streaming isolates processing logic via configurable options (e.g., custom JSON input), and easily switches between batch and streaming execution; TD also compared latency, throughput, and resource allocation of batch, micro‑batch, and continuous streaming modes.
Regarding time‑window support, Structured Streaming offers event‑time aggregation, user‑defined aggregate functions (UDAF), state stored in memory with HDFS write‑ahead logs, automatic handling of late data, and uses watermarks to prune old state; custom state functions and Scala/Java APIs are supported.
TD illustrated streaming applications, such as Apple’s security platform that generates millions of events per second, where Structured Streaming is used for defect detection; the architecture is shown below:
In this architecture, raw logs are ETL‑ed into a structured log store for rapid disaster recovery; it can connect to other data sources (DHCP sessions, slowly changing data); it supports mixed workloads—real‑time alerts, historical reports, ad‑hoc analysis, unified APIs—for various analyses, achieving million‑event per‑second processing performance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
