Artificial Intelligence 11 min read

Distributed Training with MXNet: Data Parallel on Single and Multi‑Node GPUs and Integration with Kubeflow

This article explains how MXNet supports data‑parallel training on single‑machine multi‑GPU and multi‑machine multi‑GPU setups, describes KVStore modes, outlines the worker‑server‑scheduler architecture, and shows how to launch large‑scale distributed training using Kubeflow and the mxnet‑operator.

360 Tech Engineering

May 10, 2019

Distributed Training with MXNet: Data Parallel on Single and Multi‑Node GPUs and Integration with Kubeflow

MXNet is a flexible and efficient deep‑learning library that supports data‑parallel training on both single‑machine multi‑GPU and multi‑machine multi‑GPU configurations.

In single‑machine multi‑GPU training, each GPU is indexed from 0 and can be selected via the context argument in code or the --gpus command‑line flag. For example, to use GPU 0 and GPU 2 you would specify the appropriate contexts.

If GPUs have different compute capabilities, workload can be balanced using the work_load_list parameter, e.g., work_load_list=[3, 1] when GPU 0 is three times faster than GPU 2.

Gradient aggregation and model updates are controlled by creating different KVStore types with mx.kvstore.create(type) or the --kv-store flag.

Two common KVStore modes are:

local : gradients are copied to CPU memory for aggregation and weight updates, then copied back to each GPU worker.

device : aggregation and updates occur on the GPUs; if Peer‑to‑Peer (PCIe or NVLink) is available, CPU copy overhead is avoided, providing better performance especially when using four or more GPUs.

For multi‑machine training, MXNet uses three process types:

Worker : pulls model parameters from servers, processes a batch, computes gradients, and pushes them back.

Server : stores model parameters and communicates with workers.

Scheduler : coordinates the cluster, registers workers and servers, and facilitates node discovery.

The training workflow is: workers and servers register with the scheduler, workers pull parameters, compute gradients, push them to servers, servers update parameters, and the cycle repeats.

MXNet provides a KVStore abstraction built on a parameter‑server architecture with a two‑level communication hierarchy (intra‑node and inter‑node). Different consistency models can be applied to each level.

Distributed training modes are selected by creating a KVStore with the “dist” keyword, for example: kv = mxnet.kvstore.create('dist_sync') Supported modes include:

dist_sync : synchronous training where all workers must finish a batch before the next batch starts.

dist_async : asynchronous training where servers update parameters as soon as gradients arrive.

dist_sync_device : same as dist_sync but aggregation occurs on GPUs.

dist_async_device : asynchronous version of the device‑based mode.

MXNet ships a tools/launch.py script to launch distributed jobs on various cluster managers (ssh, mpirun, YARN, SGE). An example uses the Gluon image_classification.py script with the CIFAR‑10 dataset to train a VGG‑11 model.

For large‑scale training, MXNet can be integrated with Kubeflow. After deploying Kubeflow, the mxnet‑operator is installed and verified with: kubectl get crd A distributed training task is defined using a PersistentVolume (CephFS) for shared data, a YAML manifest ( insightface-train.yaml), and launched with: kubectl create -f insightface-train.yaml Training progress can be monitored via Kubernetes commands or by inspecting container logs, e.g.: docker logs -f fc3d73161b27 In summary, while MXNet combined with Kubeflow enables large‑scale distributed training, performance is affected by factors such as network bandwidth (Ethernet vs. InfiniBand/RoCE), gradient compression, storage I/O, and hardware choices.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Deep Learning GPU Distributed Training Kubeflow Data Parallel KVStore MXNet

Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.