Cloud Native 38 min read

Distributed Training Orchestration and Scheduling on Kubernetes: Architecture, Challenges, and Solutions

This article examines the pain points of distributed training orchestration and scheduling, presents a layered cloud‑native architecture built on Kubernetes, explains key components such as pipeline orchestrators, training job operators, schedulers, and topology managers, and discusses practical solutions using Argo, Kubeflow Pipelines, and the Volcano scheduler.

Laiye Technology Team
Laiye Technology Team
Laiye Technology Team
Distributed Training Orchestration and Scheduling on Kubernetes: Architecture, Challenges, and Solutions

Since its founding in 2015, Laiye Technology has built AI products and accumulated large amounts of OCR and NLP data, code, and models, prompting the need to automate model fine‑tuning, evaluation, and deployment on a machine‑learning platform.

In the era of big data and large models, a deep‑learning infrastructure must provide massive compute, storage, and network resources, making a cloud‑native, highly scalable architecture essential. This series focuses on the technical challenges of machine‑learning platforms on top of Kubernetes, covering orchestration, scheduling, storage, communication, and inference. The first article addresses orchestration and scheduling.

Distributed Training Orchestration Pain Points

Different stages of model training require different resources (CPU for data processing, GPU for training); the scheduler must allocate resources automatically.

Large datasets or models need to be sharded across nodes, requiring efficient data transfer while preserving convergence.

Kubernetes schedules at pod granularity, causing resource contention and deadlocks when multiple distributed training jobs are scheduled together.

Multi‑team clusters need resource isolation, priority queues, preemption, and fairness within the same priority level.

In public clouds, the scheduler must provision on‑demand or spot instances when reserved capacity is exhausted.

NUMA‑aware scheduling should consider node topology for affinity.

Hierarchical Architecture

Based on the Kubernetes architecture, the platform can be decomposed into several layers:

Pipeline Orchestrator (e.g., Kubeflow Pipelines, Airflow, MLflow) – defines a DAG of components.

Distributed Training Framework (TensorFlow, PyTorch, Horovod) – provides parallel algorithms and communication.

Training Job Operator (TFJob, PyTorchJob, MPIJob) – translates a high‑level job into multiple pods.

Batch Job Scheduler (Volcano, default kube‑scheduler) – performs batch‑aware scheduling, fairness, and gang‑scheduling.

Topology Manager – offers NUMA‑aware placement.

Kubernetes Fundamentals

The control plane consists of etcd, kube‑apiserver, kube‑controller‑manager, and kube‑scheduler. Each node runs CNI, kubelet, CRI, CSI, and Device Plugin components. Understanding the declarative API (YAML manifests) and the controller pattern is crucial for building custom operators.

Example of a declarative Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80

Controllers continuously reconcile the desired state (spec) with the actual state (status), a pattern also used by schedulers and kubelet.

Argo and Kubeflow Pipelines

Argo is a CNCF workflow engine used by Kubeflow Pipelines. A simple Argo workflow demonstrates parameter passing:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: output-parameter-
spec:
  entrypoint: output-parameter
  templates:
  - name: output-parameter
    steps:
    - - name: generate-parameter
        template: whalesay
    - - name: consume-parameter
        template: print-message
        arguments:
          parameters:
          - name: message
            value: "{{steps.generate-parameter.outputs.parameters.hello-param}}"
  - name: whalesay
    container:
      image: docker/whalesay:latest
      command: [sh, -c]
      args: ["sleep 1; echo -n hello world > /tmp/hello_world.txt"]
    outputs:
      parameters:
      - name: hello-param
        valueFrom:
          path: /tmp/hello_world.txt
  - name: print-message
    inputs:
      parameters:
      - name: message
    container:
      image: docker/whalesay:latest
      command: [cowsay]
      args: ["{{inputs.parameters.message}}"]

Argo passes parameters via etcd (limited to ~1.5 MiB) and files via a sidecar that uploads to MinIO.

Kubeflow Pipeline Enhancements

Python SDK for DAG definition.

Step caching to reuse unchanged component results.

Visualization UI for metrics and TensorBoard.

Components can be defined as Python functions, YAML specifications, or raw Kubernetes resources (CRDs).

Distributed Training Frameworks

Data Parallelism – split data across workers.

Model Parallelism – split model layers or parameters.

Communication topologies: centralized (parameter server) vs. decentralized (Ring All‑Reduce).

Synchronization strategies: synchronous vs. asynchronous.

Example TensorFlow 1.x parameter‑server code:

def main(_):
  ps_hosts = FLAGS.ps_hosts.split(",")
  worker_hosts = FLAGS.worker_hosts.split(",")
  cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
  server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)
  if FLAGS.job_name == "ps":
    server.join()
  elif FLAGS.job_name == "worker":
    is_chief = (FLAGS.task_index == 0)
    with tf.device(tf.train.replica_device_setter(worker_device="/job:worker/task:%d" % FLAGS.task_index, cluster=cluster)):
      # build model here
      pass

Launching on four machines:

# On ps0.example.com:
python trainer.py \
  --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
  --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
  --job_name=ps --task_index=0
# ... similarly for ps1, worker0, worker1

PyTorch DDP example (setup of master address and port):

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    dist.init_process_group('gloo', rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

Horovod TensorFlow example (initialization, GPU pinning, and distributed optimizer):

import tensorflow as tf
import horovod.tensorflow.keras as hvd
hvd.init()
# Pin GPU to local rank
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
    tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
# Build dataset, model, optimizer
scaled_lr = 0.001 * hvd.size()
opt = tf.optimizers.Adam(scaled_lr)
opt = hvd.DistributedOptimizer(opt, backward_passes_per_step=1, average_aggregated_gradients=True)
# Compile and train
mnist_model.compile(loss=tf.losses.SparseCategoricalCrossentropy(), optimizer=opt, metrics=['accuracy'], experimental_run_tf_function=False)
callbacks = [hvd.callbacks.BroadcastGlobalVariablesCallback(0), hvd.callbacks.MetricAverageCallback(), hvd.callbacks.LearningRateWarmupCallback(initial_lr=scaled_lr, warmup_epochs=3, verbose=1)]
if hvd.rank() == 0:
    callbacks.append(tf.keras.callbacks.ModelCheckpoint('./checkpoint-{epoch}.h5'))
verbose = 1 if hvd.rank() == 0 else 0
mnist_model.fit(dataset, steps_per_epoch=500 // hvd.size(), callbacks=callbacks, epochs=24, verbose=verbose)

Running Horovod on 16 GPUs across four machines:

horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py

Training Operators and Volcano Scheduler

Training operators (e.g., TFJob) translate high‑level CRDs into the required pods and generate TF_CONFIG for each pod. They also create headless Services so pods can reach each other via stable DNS names.

Volcano (formerly kube‑batch) provides gang‑scheduling, bin‑packing, priority/DRF, proportion, and task‑topology plugins to solve deadlocks, resource fragmentation, fairness, and affinity requirements in batch‑oriented ML workloads.

When a TFJob is submitted, the operator creates a PodGroup CRD, binds each pod to that group, and sets schedulerName: volcano. Volcano then runs a scheduling session, enqueues pending jobs, and applies plugins such as:

Gang Plugin – all‑or‑nothing scheduling to avoid deadlocks.

Binpack Plugin – selects nodes with highest resource utilization to reduce fragmentation.

Priority/DRF Plugin – ensures fair‑share across users.

Proportion Plugin – enforces team‑level quotas.

Task‑Topology Plugin – expresses pod‑level affinity/anti‑affinity.

These mechanisms enable reliable, efficient distributed training on Kubernetes.

Conclusion

The article presented a layered cloud‑native architecture for large‑scale model training, detailed the orchestration, operator, and scheduling challenges, and introduced open‑source solutions such as Argo, Kubeflow Pipelines, and Volcano. Understanding both Kubernetes internals and distributed‑ML theory is essential for building robust MLOps systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KubernetesOperatorschedulingDistributed TrainingML Platform
Laiye Technology Team
Written by

Laiye Technology Team

Official account of Laiye Technology, featuring its best tech innovations, practical implementations, and cutting‑edge industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.