360 Tech Engineering
May 10, 2019 · Artificial Intelligence
Distributed Training with MXNet: Data Parallel on Single and Multi‑Node GPUs and Integration with Kubeflow
This article explains how MXNet supports data‑parallel training on single‑machine multi‑GPU and multi‑machine multi‑GPU setups, describes KVStore modes, outlines the worker‑server‑scheduler architecture, and shows how to launch large‑scale distributed training using Kubeflow and the mxnet‑operator.
Data ParallelGPUKVStore
0 likes · 11 min read