How Distributed Machine Learning Platforms Compare: Spark, PMLS, TensorFlow

This article surveys distributed machine‑learning platforms, classifies them into basic data‑flow, parameter‑server, and advanced data‑flow models, examines Spark, PMLS (Petuum), TensorFlow and MXNet, presents performance comparisons on EC2 instances, and discusses bottlenecks, fault tolerance, and future research directions.

21CTO
21CTO
21CTO
How Distributed Machine Learning Platforms Compare: Spark, PMLS, TensorFlow

Abstract

Machine learning, especially deep learning (DL), has recently succeeded in speech recognition, image recognition, natural language processing, recommendation/search engines, and more. These technologies have promising applications in autonomous vehicles, digital health systems, CRM, advertising, IoT, and many machine‑learning platforms are emerging.

Paper Overview

This paper surveys the design methods of several distributed machine learning platforms and proposes future research directions. It was written in the fall of 2016 by the author together with students Kuo Zhang and Salem Alqahtani and was presented at ICCCN'17 (International Conference on Computer Communications and Networks).

Why Distributed Platforms?

Training large datasets and models requires massive computational resources, so machine‑learning platforms are typically distributed and run 10 to 100 jobs in parallel. It is estimated that most data‑center tasks in the near future will be machine‑learning workloads.

Classification of Design Methods

From a distributed‑systems perspective, the platforms are classified into three basic design approaches:

Basic data‑flow

Parameter‑server model

Advanced data‑flow

Basic Data‑Flow: Spark

In Spark, computation is modeled as a directed acyclic graph (DAG) where each vertex represents a Resilient Distributed Dataset (RDD) and each edge represents an operation on an RDD. Transformations (e.g., map, filter, join) create new RDDs, while actions trigger execution.

Spark’s DAG is compiled into stages, each executed as a set of parallel tasks (one per partition). Narrow dependencies enable efficient execution, whereas wide dependencies cause bottlenecks due to shuffle and communication overhead.

Although Spark was designed for general data processing, its MLlib library enables machine‑learning tasks. In the basic setup, model parameters reside on the driver and workers communicate with the driver each iteration. For large‑scale deployments, storing parameters as RDDs incurs high overhead because new RDDs must be created after each iteration, leading to costly data shuffling and limited scalability. Spark does not natively support the iterative workloads typical of machine learning.

Parameter‑Server Model: PMLS

PMLS (Petuum) was built specifically for machine learning and introduces a parameter‑server (PS) abstraction for iterative training. Each node acts both as a primary for a shard of the model and as a replica for other shards, allowing easy scaling by adding nodes.

PS nodes store and update model parameters and respond to workers’ requests. Workers fetch the latest parameters from local PS replicas and compute on their assigned data partitions.

PMLS also adopts Stale Synchronous Parallelism (SSP), which relaxes the strict synchronization of Bulk Synchronous Parallelism (BSP). SSP reduces synchronization difficulty while tolerating some staleness, making it suitable for machine‑learning training.

Advanced Data‑Flow: TensorFlow

Google’s DistBelief parameter‑server model inspired TensorFlow, which uses a data‑flow paradigm where the computation graph can contain cycles and mutable state. Nodes represent operations with mutable state; edges carry multi‑dimensional tensors.

TensorFlow requires users to declare a static symbolic graph, which is then rewritten and partitioned for distributed execution.

In distributed TensorFlow training, a parameter server stores model parameters while workers perform data‑parallel computation. Custom code is often needed for more complex scenarios.

Evaluation Results

We evaluated the platforms on Amazon EC2 m4.xlarge instances (4 vCPUs, 16 GB RAM, 750 Mbps EBS bandwidth). Two common machine‑learning tasks were used: binary logistic regression with a multilayer perceptron and image classification. Experiments were limited to a small number of CPUs and did not include GPU testing.

Performance graphs show that Spark is slower than PMLS and MXNet for logistic regression, and its slowdown is more pronounced for deep neural networks due to higher iteration overhead. Spark’s CPU utilization is higher because of serialization costs, confirming earlier findings.

Conclusion and Future Directions

Distributed machine‑learning applications exhibit awkward parallelism, and from a concurrency‑algorithm perspective they are not particularly interesting. The parameter‑server approach has become the dominant training paradigm.

Network bandwidth remains the primary bottleneck, though for Spark CPU overhead can surpass network limits. Better data and model partitioning strategies are more valuable than generic data‑flow platforms.

Monitoring and performance prediction tools (e.g., Ernest, CherryPick) are needed to address Spark’s CPU bottleneck and to provide runtime elasticity for compute, memory, and network resources.

Open challenges include resource scheduling, runtime performance improvement, and the development of distributed programming abstractions tailored to machine‑learning workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

TensorFlowPerformance EvaluationSparkParameter Server
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.