Big Data 8 min read

Highlights from Spark+AI Summit 2018: Hydrogen, MLflow, Delta, Spark 2.3, and Shuffle Optimization

The 2018 Spark+AI Summit in San Francisco showcased Spark's evolution toward unified AI and big‑data processing, introducing the Hydrogen project with gang scheduling, the open‑source MLflow platform, the Delta unified analytics engine, Spark 2.3 enhancements, and Facebook's shuffle I/O optimizations.

Liulishuo Tech Team

Jun 12, 2018

Highlights from Spark+AI Summit 2018: Hydrogen, MLflow, Delta, Spark 2.3, and Shuffle Optimization

The three‑day 2018 Spark Summit, renamed Spark+AI and held June 4‑6 at the Moscone Center in San Francisco, emphasized the growing convergence of big‑data processing and artificial intelligence, signaling Spark's future direction.

Hydrogen was presented as a key initiative to bridge Spark with machine‑learning frameworks. Its core Gang scheduling component uses the barrier API to partition jobs into stages where all tasks must be scheduled together, enabling an "all‑or‑nothing" failure model. A demo code snippet illustrates the barrier usage:

# barrier() introduces a new execution mode where all tasks run together
# runHorovod() launches a Horovod job via MPI using the task
model = digits.repartition(2) \
      .toPandasRdd() \
      .barrier() \
      .mapPartitions(runHorovod) \
      .collect()[0]

On Day 2, Databricks CTO Matei Zaharia announced MLflow , an open‑source platform that streamlines the entire machine‑learning lifecycle—from data preparation to model training, testing, and deployment.

Databricks CEO Ali Ghodsi introduced Delta , a unified data‑analysis platform designed to eliminate data silos, bridge the gap between data engineers and scientists, and simplify productionizing machine‑learning models. He highlighted three major challenges:

Data islands – data not ready for analytics.

Siloed data engineers and data scientists.

Difficulty turning AI applications into production‑ready products.

Delta removes manual ETL steps by allowing both batch and streaming data to be written directly to Delta tables, which support ACID transactions and column indexing. Example SQL for creating and querying a Delta table:

CREATE TABLE connections
USING delta 
AS SELECT * FROM json.'/data/connections';

SELECT * FROM connections WHERE dest_port = 666;

Streaming data can be loaded in real time with:

INSERT INTO connections SELECT * FROM kafkaStream;

The Spark 2.3 release, presented by committer Sameer Mane, brought several notable features: Structured Streaming continuous processing (reducing latency to ~1 ms), image loading into DataFrames, DataSource V2 APIs, pandas UDF support for Python users, and expanded Kubernetes integration (including client mode and dynamic resource allocation).

Facebook shared research on optimizing shuffle I/O. By deploying a SOS Shuffle Service on each node and a SOS Merge Scheduler on the driver, small shuffle files are merged into larger ones, reducing the total number of I/O operations from M × R to (M × R) / merge‑factor, while preserving partitioned layouts.

The article concludes with links to previous Spark Summit experiences and a brief note that Liulishuo's data team is hiring engineers experienced with Hadoop, Spark, Kafka, and Presto.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

.ai Spark Delta Lake Shuffle Optimization Hydrogen MLflow

Written by

Liulishuo Tech Team

Help everyone become a global citizen!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.