New Features in Apache Spark 2.3: Continuous Streaming, Kubernetes Scheduler, Pandas UDFs, and MLlib Enhancements
Apache Spark 2.3 introduces major upgrades such as millisecond‑latency continuous streaming, stream‑to‑stream joins, a native Kubernetes scheduler backend, accelerated Pandas UDFs, and several MLlib improvements, all aimed at making big‑data processing faster, easier, and smarter.
Millisecond‑Latency Continuous Streaming
Structured Streaming in Spark 2.0 decoupled micro‑batch processing from its high‑level APIs, but Spark 2.3 adds an experimental continuous mode that processes data record‑by‑record, achieving end‑to‑end latency measured in milliseconds.
In continuous mode the source reader pulls data continuously instead of waiting for a trigger interval, allowing immediate processing of new records while still supporting map‑like Dataset operations such as projections and selections. Functions like current_timestamp() and current_date() and aggregation functions are not yet supported. The mode also supports Kafka as both source and sink, as well as console and memory sinks.
Developers can now choose between micro‑batch and continuous processing based on latency requirements, while retaining Structured Streaming's fault‑tolerance and reliability guarantees.
End‑to‑end millisecond latency
At‑least‑once delivery semantics
Support for map‑like Dataset operations
Stream‑to‑Stream Joins
Spark 2.3 adds native support for joining two streaming DataFrames/Datasets, enabling both inner and outer joins for real‑time use cases such as ad‑revenue analysis where impression and click streams share a common key.
Cache delayed records until a matching event arrives
Use watermarking to bound state size
Allow trade‑offs between resource usage and latency
Maintain consistent SQL semantics between static and streaming joins
Apache Spark and Kubernetes
Spark 2.3 introduces a new Kubernetes scheduler backend, allowing Spark applications to launch executors on a Kubernetes cluster and share resources with other workloads. The integration also brings Kubernetes features such as resource quotas, pluggable authorization, and logging to Spark.
Support for Pandas UDFs in PySpark
Pandas UDFs (also called vectorized UDFs) leverage Apache Arrow to provide low‑overhead, high‑performance user‑defined functions written in Python. Spark 2.3 supports both scalar and grouped‑map Pandas UDFs, delivering significant speedups over traditional row‑by‑row UDFs.
MLlib Enhancements
Spark 2.3 brings several MLlib improvements, including the ability to use fitted models and pipelines inside Structured Streaming jobs, the new ImageSchema for representing images in DataFrames, and an enhanced Python API for custom algorithms.
These updates collectively make Spark 2.3 a more powerful platform for large‑scale, low‑latency data processing and machine‑learning workloads.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.