Big Data 17 min read

Apache Spark Latest Technological Developments and Outlook for Spark 3.0+

The article provides a comprehensive overview of recent Apache Spark advancements—including Delta Lake, Data Source V2, runtime optimizations, relational cache, cloud‑native challenges, AI integration via Project Hydrogen, and the anticipated features of Spark 3.0—highlighting how these innovations address modern data‑warehouse, cloud, and machine‑learning workloads.

Big Data Technology & Architecture

Aug 5, 2019

Apache Spark Latest Technological Developments and Outlook for Spark 3.0+

Alibaba senior technical expert Li Chengxiang delivered a detailed analysis titled "Apache Spark Latest Technological Development and 3.0+ Outlook" at the 2019 Alibaba Cloud Summit in Shanghai, introducing new challenges and progress for Spark in a cloud‑centric IT infrastructure and previewing the upcoming Spark 3.0 features.

1. Improvements and Enhancements for Spark in the Data‑Warehouse Direction

Delta Lake (open‑sourced by Databricks in April 2019) adds schema‑aware storage, enabling data updates and transactional guarantees, allowing Spark Streaming to write to Delta Lake and query via SparkSQL for real‑time warehousing, while automatically merging small files and supporting versioned snapshots.

Other storage middle‑layers such as Uber's Hudi and Netflix's Iceberg address similar needs.

Data Source V2 redesigns the data source API to unify batch and streaming, provide more flexible push‑down capabilities, and support richer metadata management (e.g., JSON‑described catalogs).

Runtime Optimization includes Adaptive Execution (dynamic Reduce task sizing, adaptive join strategy, and handling data skew) and EMR Runtime Filter, which uses small‑table join keys to filter large‑table reads.

Spark Relational Cache caches relational data (tables, views, datasets) in memory, HDFS, OSS, etc., and can organize cached data by partitioning, bucketing, sorting, or file indexing to accelerate repeat queries, especially for fixed query patterns.

2. How Spark Addresses Cloud‑Native Challenges

In cloud environments, storage‑compute separation improves cost efficiency but introduces performance gaps because Spark was originally designed for HDFS. Issues include costly rename operations on OSS and limited network bandwidth.

EMR JindoFS provides a file‑system API and metadata management optimized for Spark, along with local caching to deliver near‑local performance while keeping most data in OSS.

Remote Shuffle Service moves shuffle data to an external service, eliminating the need for large local disks on compute nodes and enabling more elastic scaling.

Spark on Kubernetes (supported since Spark 2.3) now includes PySpark and R support, client mode, dynamic allocation, and upcoming Kerberos integration, paving the way for better resource elasticity.

3. Deep Integration of Spark with AI Frameworks

Project Hydrogen bridges Spark’s data‑processing strengths with deep‑learning frameworks (e.g., TensorFlow) through three components:

Barrier Execution : launches all tasks simultaneously, allowing synchronized deep‑learning workloads and collective failure handling.

Accelerator‑Aware Scheduling : detects GPU (or other accelerator) resources via YARN or similar managers and schedules deep‑learning tasks accordingly.

Optimized Data Exchange : uses Apache Arrow to transfer data efficiently between Spark and AI frameworks, supporting both training and inference pipelines.

4. Outlook for Spark 3.0 Features

Spark 3.0 is expected to fully deliver Project Hydrogen (GPU‑aware scheduling and Optimized Data Exchange), extend Adaptive Execution for better performance and concurrency, incorporate Data Source V2 for richer plug‑in capabilities, and enhance Spark on Kubernetes with dynamic resource allocation and Kerberos support. Additional improvements include upgraded Hadoop/Hive dependencies and broader SQL compatibility.

Overall, Spark 3.0 represents a major version upgrade that consolidates data‑warehouse, cloud‑native, and AI integration advancements.

— THE END —

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Warehouse Apache Spark Delta Lake runtime optimization Spark 3.0

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.