Big Data 7 min read

New Features in Apache Spark 2.3: Continuous Streaming, Kubernetes Scheduler, Pandas UDFs, and MLlib Enhancements

Apache Spark 2.3 introduces major upgrades such as millisecond‑latency continuous streaming, stream‑to‑stream joins, a native Kubernetes scheduler backend, accelerated Pandas UDFs, and several MLlib improvements, all aimed at making big‑data processing faster, easier, and smarter.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
New Features in Apache Spark 2.3: Continuous Streaming, Kubernetes Scheduler, Pandas UDFs, and MLlib Enhancements

Millisecond‑Latency Continuous Streaming

Structured Streaming in Spark 2.0 decoupled micro‑batch processing from its high‑level APIs, but Spark 2.3 adds an experimental continuous mode that processes data record‑by‑record, achieving end‑to‑end latency measured in milliseconds.

In continuous mode the source reader pulls data continuously instead of waiting for a trigger interval, allowing immediate processing of new records while still supporting map‑like Dataset operations such as projections and selections. Functions like current_timestamp() and current_date() and aggregation functions are not yet supported. The mode also supports Kafka as both source and sink, as well as console and memory sinks.

Developers can now choose between micro‑batch and continuous processing based on latency requirements, while retaining Structured Streaming's fault‑tolerance and reliability guarantees.

End‑to‑end millisecond latency

At‑least‑once delivery semantics

Support for map‑like Dataset operations

Stream‑to‑Stream Joins

Spark 2.3 adds native support for joining two streaming DataFrames/Datasets, enabling both inner and outer joins for real‑time use cases such as ad‑revenue analysis where impression and click streams share a common key.

Cache delayed records until a matching event arrives

Use watermarking to bound state size

Allow trade‑offs between resource usage and latency

Maintain consistent SQL semantics between static and streaming joins

Apache Spark and Kubernetes

Spark 2.3 introduces a new Kubernetes scheduler backend, allowing Spark applications to launch executors on a Kubernetes cluster and share resources with other workloads. The integration also brings Kubernetes features such as resource quotas, pluggable authorization, and logging to Spark.

Support for Pandas UDFs in PySpark

Pandas UDFs (also called vectorized UDFs) leverage Apache Arrow to provide low‑overhead, high‑performance user‑defined functions written in Python. Spark 2.3 supports both scalar and grouped‑map Pandas UDFs, delivering significant speedups over traditional row‑by‑row UDFs.

MLlib Enhancements

Spark 2.3 brings several MLlib improvements, including the ability to use fitted models and pipelines inside Structured Streaming jobs, the new ImageSchema for representing images in DataFrames, and an enhanced Python API for custom algorithms.

These updates collectively make Spark 2.3 a more powerful platform for large‑scale, low‑latency data processing and machine‑learning workloads.

big datakubernetesApache SparkStructured StreamingContinuous ProcessingMLlibPandas UDF
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.