Highlights from Spark+AI Summit 2018: Hydrogen, MLflow, Delta, Spark 2.3, and Shuffle Optimization
The 2018 Spark+AI Summit in San Francisco showcased Spark's evolution toward unified AI and big‑data processing, introducing the Hydrogen project with gang scheduling, the open‑source MLflow platform, the Delta unified analytics engine, Spark 2.3 enhancements, and Facebook's shuffle I/O optimizations.
The three‑day 2018 Spark Summit, renamed Spark+AI and held June 4‑6 at the Moscone Center in San Francisco, emphasized the growing convergence of big‑data processing and artificial intelligence, signaling Spark's future direction.
Hydrogen was presented as a key initiative to bridge Spark with machine‑learning frameworks. Its core Gang scheduling component uses the barrier API to partition jobs into stages where all tasks must be scheduled together, enabling an "all‑or‑nothing" failure model. A demo code snippet illustrates the barrier usage:
# barrier() introduces a new execution mode where all tasks run together
# runHorovod() launches a Horovod job via MPI using the task
model = digits.repartition(2) \
.toPandasRdd() \
.barrier() \
.mapPartitions(runHorovod) \
.collect()[0]On Day 2, Databricks CTO Matei Zaharia announced MLflow , an open‑source platform that streamlines the entire machine‑learning lifecycle—from data preparation to model training, testing, and deployment.
Databricks CEO Ali Ghodsi introduced Delta , a unified data‑analysis platform designed to eliminate data silos, bridge the gap between data engineers and scientists, and simplify productionizing machine‑learning models. He highlighted three major challenges:
Data islands – data not ready for analytics.
Siloed data engineers and data scientists.
Difficulty turning AI applications into production‑ready products.
Delta removes manual ETL steps by allowing both batch and streaming data to be written directly to Delta tables, which support ACID transactions and column indexing. Example SQL for creating and querying a Delta table:
CREATE TABLE connections
USING delta
AS SELECT * FROM json.'/data/connections';
SELECT * FROM connections WHERE dest_port = 666;Streaming data can be loaded in real time with:
INSERT INTO connections SELECT * FROM kafkaStream;The Spark 2.3 release, presented by committer Sameer Mane, brought several notable features: Structured Streaming continuous processing (reducing latency to ~1 ms), image loading into DataFrames, DataSource V2 APIs, pandas UDF support for Python users, and expanded Kubernetes integration (including client mode and dynamic resource allocation).
Facebook shared research on optimizing shuffle I/O. By deploying a SOS Shuffle Service on each node and a SOS Merge Scheduler on the driver, small shuffle files are merged into larger ones, reducing the total number of I/O operations from M × R to (M × R) / merge‑factor, while preserving partitioned layouts.
The article concludes with links to previous Spark Summit experiences and a brief note that Liulishuo's data team is hiring engineers experienced with Hadoop, Spark, Kafka, and Presto.
Liulishuo Tech Team
Help everyone become a global citizen!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.