Big Data 7 min read

New Features in Apache Spark 2.3: Continuous Streaming, Kubernetes Scheduler, Pandas UDFs, and MLlib Enhancements

Apache Spark 2.3 introduces major upgrades such as millisecond‑latency continuous streaming, stream‑to‑stream joins, a native Kubernetes scheduler backend, accelerated Pandas UDFs, and several MLlib improvements, all aimed at making big‑data processing faster, easier, and smarter.

Qunar Tech Salon

Mar 9, 2018

New Features in Apache Spark 2.3: Continuous Streaming, Kubernetes Scheduler, Pandas UDFs, and MLlib Enhancements

Millisecond‑Latency Continuous Streaming

Structured Streaming in Spark 2.0 decoupled micro‑batch processing from its high‑level APIs, but Spark 2.3 adds an experimental continuous mode that processes data record‑by‑record, achieving end‑to‑end latency measured in milliseconds.

In continuous mode the source reader pulls data continuously instead of waiting for a trigger interval, allowing immediate processing of new records while still supporting map‑like Dataset operations such as projections and selections. Functions like current_timestamp() and current_date() and aggregation functions are not yet supported. The mode also supports Kafka as both source and sink, as well as console and memory sinks.

Developers can now choose between micro‑batch and continuous processing based on latency requirements, while retaining Structured Streaming's fault‑tolerance and reliability guarantees.

End‑to‑end millisecond latency

At‑least‑once delivery semantics

Support for map‑like Dataset operations

Stream‑to‑Stream Joins

Spark 2.3 adds native support for joining two streaming DataFrames/Datasets, enabling both inner and outer joins for real‑time use cases such as ad‑revenue analysis where impression and click streams share a common key.

Cache delayed records until a matching event arrives

Use watermarking to bound state size

Allow trade‑offs between resource usage and latency

Maintain consistent SQL semantics between static and streaming joins

Apache Spark and Kubernetes

Spark 2.3 introduces a new Kubernetes scheduler backend, allowing Spark applications to launch executors on a Kubernetes cluster and share resources with other workloads. The integration also brings Kubernetes features such as resource quotas, pluggable authorization, and logging to Spark.

Support for Pandas UDFs in PySpark

Pandas UDFs (also called vectorized UDFs) leverage Apache Arrow to provide low‑overhead, high‑performance user‑defined functions written in Python. Spark 2.3 supports both scalar and grouped‑map Pandas UDFs, delivering significant speedups over traditional row‑by‑row UDFs.

MLlib Enhancements

Spark 2.3 brings several MLlib improvements, including the ability to use fitted models and pipelines inside Structured Streaming jobs, the new ImageSchema for representing images in DataFrames, and an enhanced Python API for custom algorithms.

These updates collectively make Spark 2.3 a more powerful platform for large‑scale, low‑latency data processing and machine‑learning workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Apache Spark Structured Streaming Continuous Processing MLlib Pandas UDF

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.