Artificial Intelligence 11 min read

Master Distributed MXNet Training with Kubeflow: A Step‑by‑Step Guide

Learn how to perform single‑machine multi‑GPU and multi‑node multi‑GPU training with MXNet, understand KVStore modes, configure workers, servers, and schedulers, and deploy large‑scale distributed training on Kubernetes using Kubeflow, including operator installation, task creation, and performance considerations.

360 Zhihui Cloud Developer

May 9, 2019

Master Distributed MXNet Training with Kubeflow: A Step‑by‑Step Guide

MXNet Single‑Machine Multi‑GPU Training

MXNet supports training on multiple CPUs and GPUs, which can be spread across different servers. Each GPU on a machine has an index starting from 0. You can select specific GPUs via context(ctx) in code or the --gpus command‑line argument. Example: to use GPU 0 and GPU 2 in Python, create the network model accordingly.

Workload Balancing for Heterogeneous GPUs

If GPUs have different compute capabilities, you can assign workloads proportionally, e.g., work_load_list=[3, 1] when GPU 0 is three times faster than GPU 2. With identical hyper‑parameters, results across GPUs should match, though randomness in data ordering, seeds, or CuDNN may cause minor differences.

Two Common KVStore Modes

local : All gradients are copied to CPU memory for aggregation and weight updates, then copied back to each GPU worker. This mode puts most load on the CPU.

device : Gradient aggregation and weight updates happen directly on GPUs. If Peer‑to‑Peer (PCIe or NVLink) is supported, CPU copy overhead is avoided, offering better performance, especially when using four or more GPUs.

Multi‑Node Multi‑GPU Training Overview

MXNet can train across multiple machines, each with multiple GPUs, using a parameter‑server architecture.

Process Types in Distributed Training

Worker : Processes each batch, pulls model parameters from servers, computes gradients, and pushes them back.

Server : Stores model parameters and communicates with workers.

Scheduler : Coordinates the cluster, tracking node status and facilitating communication.

KVStore Details

KVStore is MXNet's distributed key‑value store built on top of the parameter server. It manages data consistency via the engine and uses a two‑level communication hierarchy: the first level handles intra‑machine device communication, while the second level manages inter‑machine network traffic, allowing different consistency models per level.

Distributed Training Modes

dist_sync : Synchronous training; all workers must finish a batch before the next batch starts.

dist_async : Asynchronous training; servers update parameters as soon as they receive gradients, offering faster iteration but requiring more epochs to converge.

dist_sync_device : Like dist_sync but performs gradient aggregation and weight updates on GPUs instead of CPUs, reducing communication latency at the cost of GPU memory.

dist_async_device : Asynchronous version of the device‑based mode.

Launching Distributed Training with Kubeflow

MXNet provides a

tools/launch.py</script> script that supports various cluster resources (ssh, mpirun, YARN, SGE). To run large‑scale training on Kubernetes, install the <code>mxnet‑operator

after setting up the Kubeflow environment.

Installing mxnet‑operator

After installing the operator, verify the installation with: kubectl get crd If the custom resource definitions are listed, the installation succeeded.

Testing MXNet Distributed Training on Kubeflow

Prepare a training Docker image, create a CephFS persistent volume, and define a training job YAML (e.g., insightface-train.yaml). Deploy the job with: kubectl create -f insightface-train.yaml Monitor the job status and view logs using:

docker logs -f fc3d73161b27

Performance Considerations

Beyond functional correctness, factors such as network bandwidth (InfiniBand or RoCE vs. Ethernet), gradient compression, storage performance (SSD vs. HDD, record.io format), and data‑plane optimizations significantly affect distributed training speed and scalability.

Summary

MXNet combined with Kubeflow enables large‑scale distributed training, but achieving optimal performance requires careful attention to hardware interconnects, KVStore configuration, gradient compression, and storage I/O characteristics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes GPU distributed training Kubeflow MXNet

Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.