Distributed Training Orchestration and Scheduling on Kubernetes: Architecture, Challenges, and Solutions
This article examines the pain points of distributed training orchestration and scheduling, presents a layered cloud‑native architecture built on Kubernetes, explains key components such as pipeline orchestrators, training job operators, schedulers, and topology managers, and discusses practical solutions using Argo, Kubeflow Pipelines, and the Volcano scheduler.
Since its founding in 2015, Laiye Technology has built AI products and accumulated large amounts of OCR and NLP data, code, and models, prompting the need to automate model fine‑tuning, evaluation, and deployment on a machine‑learning platform.
In the era of big data and large models, a deep‑learning infrastructure must provide massive compute, storage, and network resources, making a cloud‑native, highly scalable architecture essential. This series focuses on the technical challenges of machine‑learning platforms on top of Kubernetes, covering orchestration, scheduling, storage, communication, and inference. The first article addresses orchestration and scheduling.
Distributed Training Orchestration Pain Points
Different stages of model training require different resources (CPU for data processing, GPU for training); the scheduler must allocate resources automatically.
Large datasets or models need to be sharded across nodes, requiring efficient data transfer while preserving convergence.
Kubernetes schedules at pod granularity, causing resource contention and deadlocks when multiple distributed training jobs are scheduled together.
Multi‑team clusters need resource isolation, priority queues, preemption, and fairness within the same priority level.
In public clouds, the scheduler must provision on‑demand or spot instances when reserved capacity is exhausted.
NUMA‑aware scheduling should consider node topology for affinity.
Hierarchical Architecture
Based on the Kubernetes architecture, the platform can be decomposed into several layers:
Pipeline Orchestrator (e.g., Kubeflow Pipelines, Airflow, MLflow) – defines a DAG of components.
Distributed Training Framework (TensorFlow, PyTorch, Horovod) – provides parallel algorithms and communication.
Training Job Operator (TFJob, PyTorchJob, MPIJob) – translates a high‑level job into multiple pods.
Batch Job Scheduler (Volcano, default kube‑scheduler) – performs batch‑aware scheduling, fairness, and gang‑scheduling.
Topology Manager – offers NUMA‑aware placement.
Kubernetes Fundamentals
The control plane consists of etcd, kube‑apiserver, kube‑controller‑manager, and kube‑scheduler. Each node runs CNI, kubelet, CRI, CSI, and Device Plugin components. Understanding the declarative API (YAML manifests) and the controller pattern is crucial for building custom operators.
Example of a declarative Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80Controllers continuously reconcile the desired state (spec) with the actual state (status), a pattern also used by schedulers and kubelet.
Argo and Kubeflow Pipelines
Argo is a CNCF workflow engine used by Kubeflow Pipelines. A simple Argo workflow demonstrates parameter passing:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: output-parameter-
spec:
entrypoint: output-parameter
templates:
- name: output-parameter
steps:
- - name: generate-parameter
template: whalesay
- - name: consume-parameter
template: print-message
arguments:
parameters:
- name: message
value: "{{steps.generate-parameter.outputs.parameters.hello-param}}"
- name: whalesay
container:
image: docker/whalesay:latest
command: [sh, -c]
args: ["sleep 1; echo -n hello world > /tmp/hello_world.txt"]
outputs:
parameters:
- name: hello-param
valueFrom:
path: /tmp/hello_world.txt
- name: print-message
inputs:
parameters:
- name: message
container:
image: docker/whalesay:latest
command: [cowsay]
args: ["{{inputs.parameters.message}}"]Argo passes parameters via etcd (limited to ~1.5 MiB) and files via a sidecar that uploads to MinIO.
Kubeflow Pipeline Enhancements
Python SDK for DAG definition.
Step caching to reuse unchanged component results.
Visualization UI for metrics and TensorBoard.
Components can be defined as Python functions, YAML specifications, or raw Kubernetes resources (CRDs).
Distributed Training Frameworks
Data Parallelism – split data across workers.
Model Parallelism – split model layers or parameters.
Communication topologies: centralized (parameter server) vs. decentralized (Ring All‑Reduce).
Synchronization strategies: synchronous vs. asynchronous.
Example TensorFlow 1.x parameter‑server code:
def main(_):
ps_hosts = FLAGS.ps_hosts.split(",")
worker_hosts = FLAGS.worker_hosts.split(",")
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)
if FLAGS.job_name == "ps":
server.join()
elif FLAGS.job_name == "worker":
is_chief = (FLAGS.task_index == 0)
with tf.device(tf.train.replica_device_setter(worker_device="/job:worker/task:%d" % FLAGS.task_index, cluster=cluster)):
# build model here
passLaunching on four machines:
# On ps0.example.com:
python trainer.py \
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
--job_name=ps --task_index=0
# ... similarly for ps1, worker0, worker1PyTorch DDP example (setup of master address and port):
def setup(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
dist.init_process_group('gloo', rank=rank, world_size=world_size)
def cleanup():
dist.destroy_process_group()Horovod TensorFlow example (initialization, GPU pinning, and distributed optimizer):
import tensorflow as tf
import horovod.tensorflow.keras as hvd
hvd.init()
# Pin GPU to local rank
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
# Build dataset, model, optimizer
scaled_lr = 0.001 * hvd.size()
opt = tf.optimizers.Adam(scaled_lr)
opt = hvd.DistributedOptimizer(opt, backward_passes_per_step=1, average_aggregated_gradients=True)
# Compile and train
mnist_model.compile(loss=tf.losses.SparseCategoricalCrossentropy(), optimizer=opt, metrics=['accuracy'], experimental_run_tf_function=False)
callbacks = [hvd.callbacks.BroadcastGlobalVariablesCallback(0), hvd.callbacks.MetricAverageCallback(), hvd.callbacks.LearningRateWarmupCallback(initial_lr=scaled_lr, warmup_epochs=3, verbose=1)]
if hvd.rank() == 0:
callbacks.append(tf.keras.callbacks.ModelCheckpoint('./checkpoint-{epoch}.h5'))
verbose = 1 if hvd.rank() == 0 else 0
mnist_model.fit(dataset, steps_per_epoch=500 // hvd.size(), callbacks=callbacks, epochs=24, verbose=verbose)Running Horovod on 16 GPUs across four machines:
horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.pyTraining Operators and Volcano Scheduler
Training operators (e.g., TFJob) translate high‑level CRDs into the required pods and generate TF_CONFIG for each pod. They also create headless Services so pods can reach each other via stable DNS names.
Volcano (formerly kube‑batch) provides gang‑scheduling, bin‑packing, priority/DRF, proportion, and task‑topology plugins to solve deadlocks, resource fragmentation, fairness, and affinity requirements in batch‑oriented ML workloads.
When a TFJob is submitted, the operator creates a PodGroup CRD, binds each pod to that group, and sets schedulerName: volcano. Volcano then runs a scheduling session, enqueues pending jobs, and applies plugins such as:
Gang Plugin – all‑or‑nothing scheduling to avoid deadlocks.
Binpack Plugin – selects nodes with highest resource utilization to reduce fragmentation.
Priority/DRF Plugin – ensures fair‑share across users.
Proportion Plugin – enforces team‑level quotas.
Task‑Topology Plugin – expresses pod‑level affinity/anti‑affinity.
These mechanisms enable reliable, efficient distributed training on Kubernetes.
Conclusion
The article presented a layered cloud‑native architecture for large‑scale model training, detailed the orchestration, operator, and scheduling challenges, and introduced open‑source solutions such as Argo, Kubeflow Pipelines, and Volcano. Understanding both Kubernetes internals and distributed‑ML theory is essential for building robust MLOps systems.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Laiye Technology Team
Official account of Laiye Technology, featuring its best tech innovations, practical implementations, and cutting‑edge industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
