Distributed Training with MXNet: Data Parallel on Single and Multi‑Node GPUs and Integration with Kubeflow
This article explains how MXNet supports data‑parallel training on single‑machine multi‑GPU and multi‑machine multi‑GPU setups, describes KVStore modes, outlines the worker‑server‑scheduler architecture, and shows how to launch large‑scale distributed training using Kubeflow and the mxnet‑operator.
MXNet is a flexible and efficient deep‑learning library that supports data‑parallel training on both single‑machine multi‑GPU and multi‑machine multi‑GPU configurations.
In single‑machine multi‑GPU training, each GPU is indexed from 0 and can be selected via the context argument in code or the --gpus command‑line flag. For example, to use GPU 0 and GPU 2 you would specify the appropriate contexts.
If GPUs have different compute capabilities, workload can be balanced using the work_load_list parameter, e.g., work_load_list=[3, 1] when GPU 0 is three times faster than GPU 2.
Gradient aggregation and model updates are controlled by creating different KVStore types with mx.kvstore.create(type) or the --kv-store flag.
Two common KVStore modes are:
local : gradients are copied to CPU memory for aggregation and weight updates, then copied back to each GPU worker.
device : aggregation and updates occur on the GPUs; if Peer‑to‑Peer (PCIe or NVLink) is available, CPU copy overhead is avoided, providing better performance especially when using four or more GPUs.
For multi‑machine training, MXNet uses three process types:
Worker : pulls model parameters from servers, processes a batch, computes gradients, and pushes them back.
Server : stores model parameters and communicates with workers.
Scheduler : coordinates the cluster, registers workers and servers, and facilitates node discovery.
The training workflow is: workers and servers register with the scheduler, workers pull parameters, compute gradients, push them to servers, servers update parameters, and the cycle repeats.
MXNet provides a KVStore abstraction built on a parameter‑server architecture with a two‑level communication hierarchy (intra‑node and inter‑node). Different consistency models can be applied to each level.
Distributed training modes are selected by creating a KVStore with the “dist” keyword, for example:
kv = mxnet.kvstore.create('dist_sync')Supported modes include:
dist_sync : synchronous training where all workers must finish a batch before the next batch starts.
dist_async : asynchronous training where servers update parameters as soon as gradients arrive.
dist_sync_device : same as dist_sync but aggregation occurs on GPUs.
dist_async_device : asynchronous version of the device‑based mode.
MXNet ships a tools/launch.py script to launch distributed jobs on various cluster managers (ssh, mpirun, YARN, SGE). An example uses the Gluon image_classification.py script with the CIFAR‑10 dataset to train a VGG‑11 model.
For large‑scale training, MXNet can be integrated with Kubeflow. After deploying Kubeflow, the mxnet‑operator is installed and verified with:
kubectl get crdA distributed training task is defined using a PersistentVolume (CephFS) for shared data, a YAML manifest ( insightface-train.yaml ), and launched with:
kubectl create -f insightface-train.yamlTraining progress can be monitored via Kubernetes commands or by inspecting container logs, e.g.:
docker logs -f fc3d73161b27In summary, while MXNet combined with Kubeflow enables large‑scale distributed training, performance is affected by factors such as network bandwidth (Ethernet vs. InfiniBand/RoCE), gradient compression, storage I/O, and hardware choices.
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.