Cloud Native 20 min read

DGL Operator: A Kubernetes Native Controller for Distributed Graph Neural Network Training

DGL Operator is an open‑source Kubernetes controller that automates the lifecycle of distributed graph neural network training by handling configuration generation, graph partitioning, training execution, and resource cleanup, providing a cloud‑native solution for large‑scale GNN workloads.

360 Smart Cloud
360 Smart Cloud
360 Smart Cloud
DGL Operator: A Kubernetes Native Controller for Distributed Graph Neural Network Training

DGL Operator is an open‑source training controller developed by 360 Intelligent Engineering AI Platform team, built on the Kubernetes cloud‑native stack and the AWS DGL graph neural network algorithm framework. The project is hosted on GitHub .

Terminology

Workload (Overload) : logical concept equivalent to a Pod, the unit that produces work (graph partitioning and training).

Job : a set of workloads (or Pods) that together constitute a DGL training lifecycle.

Pod : the smallest scheduling unit in Kubernetes, may contain multiple containers.

initContainer : containers that run to completion before the main container, used for pre‑execution tasks.

Worker Pod : Pods that actually perform distributed training.

Partitioner Pod : Pods that perform graph partitioning.

Launcher Pod : lightweight Pod that coordinates the DGLJob, generates configuration, triggers partitioning and training, and releases resources.

dglrun : workflow script inside the container that controls DGL partitioning and training.

ipconfig : configuration file listing each worker’s name, IP, port and GPU requirements.

kubexec.sh : helper script for remote command execution in Kubernetes.

Single‑machine partition : graph partitioning performed in a single Partitioner Pod.

Distributed partition : experimental ParMETIS‑based partitioning across multiple Worker Pods.

Background

In recent years graph neural networks (GNN) have achieved great success in search, recommendation and knowledge graph domains, but building GNN models quickly remains difficult. While deep‑learning frameworks such as TensorFlow, PyTorch and MXNet provide many ready‑to‑use APIs for CNN/RNN, they lack convenient GNN APIs. DGL, co‑developed by NYU and Amazon, fills this gap by offering graph‑centric APIs and performance optimizations.

Industrial workloads often involve graphs with tens of millions to billions of nodes and edges, prompting Amazon to release a distributed training mode for DGL in 2020.

Challenges of native DGL distributed training

Requires pre‑provisioning of many physical machines or VMs and manual collection of IPs and ports into an ipconfig file.

SSH key‑less trust must be set up across all machines.

Graph partitioning must be manually triggered on the master node; using distributed partitioning also requires installing ParMETIS.

Training must be manually started from the master node; the process cannot be fully automated.

After training, resources must be released manually.

DGL Operator solution

DGL Operator addresses the above challenges by implementing a Kubernetes controller that manages the entire DGL training lifecycle. When a user submits a DGLJob, the Operator automatically schedules suitable hosts, creates Pods with the required images, generates the ipconfig file, performs graph partitioning, triggers fully distributed training, persists model outputs and finally releases compute and storage resources.

Kubernetes and Operator basics

Kubernetes automates deployment, scaling and management of containerized applications. An Operator extends Kubernetes with custom resources and controllers to manage stateful workloads. The Operator watches the desired state defined in a custom resource (DGLJob) and reconciles the actual cluster state accordingly.

Using DGL Operator

Users only need to submit a YAML that defines a DGLJob custom resource. The Operator creates the necessary ConfigMap, RBAC objects, initContainers (e.g., kubectl‑download, watcher‑loop), Launcher, Partitioner and Worker Pods. The user’s Docker image must contain DGL, its dependencies and the dglrun script.

API example

apiVersion: qihoo.net/v1alpha1
kind: DGLJob
metadata:
  name: dgl-graphsage
  namespace: dgl-operator
spec:
  cleanPodPolicy: Running
  partitionMode: DGL-API
  dglReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: dgloperator/graphsage:v0.1.0
            name: dgl-graphsage
            command:
            - dglrun
            args:
            - --graph-name
            - graphsage
            # partition arguments
            - --partition-entry-point
            - code/load_and_partition_graph.py
            - --num-partitions
            - "2"
            # training arguments
            - --train-entry-point
            - code/train_dist.py
            - --num-epochs
            - "1"
            - --batch-size
            - "1000"
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - image: dgloperator/graphsage:v0.1.0
            name: dgl-graphsage

Key fields:

cleanPodPolicy can be Running , None or All to control pod deletion after completion.

Launcher and Worker follow the standard PodTemplateSpec definition.

partitionMode can be ParMETIS or DGL-API ; the latter uses DGL’s native dgl.distributed.partition_graph API.

Generated Launcher Pod definition

kind: Pod
apiVersion: v1
metadata:
  name: dgl-graphsage-launcher
spec:
  volumes:
    - name: kube-volume
      emptyDir: {}
    - name: dataset-volume
      emptyDir: {}
    - name: config-volume
      configMap:
        name: dgl-graphsage-config
        items:
          - key: kubexec.sh
            path: kubexec.sh
            mode: 365
          - key: hostfile
            path: hostfile
            mode: 292
          - key: partfile
            path: partfile
            mode: 292
  initContainers:
    - name: kubectl-download
      image: 'dgloperator/kubectl-download:v0.1.0'
      volumeMounts:
        - name: kube-volume
          mountPath: /opt/kube
      imagePullPolicy: Always
    - name: watcher-loop-partitioner
      image: 'dgloperator/watcher-loop:v0.1.0'
      env:
        - name: WATCHERFILE
          value: /etc/dgl/partfile
        - name: WATCHERMODE
          value: finished
      volumeMounts:
        - name: config-volume
          mountPath: /etc/dgl
        - name: dataset-volume
          mountPath: /dgl_workspace/dataset
      imagePullPolicy: Always
    - name: watcher-loop-worker
      image: 'dgloperator/watcher-loop:v0.1.0'
      env:
        - name: WATCHERFILE
          value: /etc/dgl/hostfile
        - name: WATCHERMODE
          value: ready
      volumeMounts:
        - name: config-volume
          mountPath: /etc/dgl
      imagePullPolicy: Always
  containers:
    - name: dgl-graphsage
      image: 'dgloperator/graphsage:v0.1.0'
      command:
        - dglrun
      args:
        - '--graph-name'
        - graphsage
        - '--partition-entry-point'
        - code/load_and_partition_graph.py
        - '--num-partitions'
        - '2'
        - '--balance-train'
        - '--balance-edges'
        - '--train-entry-point'
        - code/train_dist.py
        - '--num-epochs'
        - '1'
        - '--batch-size'
        - '1000'
        - '--num-trainers'
        - '1'
        - '--num-samplers'
        - '4'
        - '--num-servers'
        - '1'
      volumeMounts:
        - name: kube-volume
          mountPath: /opt/kube
        - name: config-volume
          mountPath: /etc/dgl
        - name: dataset-volume
          mountPath: /dgl_workspace/dataset
      imagePullPolicy: Always
  restartPolicy: Never

Generated Partitioner and Worker Pod definitions

kind: Pod
apiVersion: v1
metadata:
  name: dgl-graphsage-partitioner
spec:
  volumes:
    - name: config-volume
      configMap:
        name: dgl-graphsage-config
        items:
          - key: kubexec.sh
            path: kubexec.sh
            mode: 365
          - key: hostfile
            path: hostfile
            mode: 292
          - key: partfile
            path: partfile
            mode: 292
          - key: leadfile
            path: leadfile
            mode: 292
    - name: kube-volume
      emptyDir: {}
  initContainers:
    - name: kubectl-download
      image: 'dgloperator/kubectl-download:v0.1.0'
      volumeMounts:
        - name: kube-volume
          mountPath: /opt/kube
      imagePullPolicy: Always
  containers:
    - name: dgl-graphsage
      image: 'dgloperator/graphsage:v0.1.0'
      env:
        - name: DGL_OPERATOR_PHASE_ENV
          value: Partitioner
      volumeMounts:
        - name: config-volume
          mountPath: /etc/dgl
        - name: kube-volume
          mountPath: /opt/kube
      imagePullPolicy: Always
  restartPolicy: Never
kind: Pod
apiVersion: v1
metadata:
  name: dgl-graphsage-worker-0
spec:
  volumes:
    - name: shm-volume
      emptyDir:
        medium: Memory
        sizeLimit: 10G
    - name: config-volume
      configMap:
        name: dgl-graphsage-config
        items:
          - key: kubexec.sh
            path: kubexec.sh
            mode: 365
          - key: hostfile
            path: hostfile
            mode: 292
          - key: partfile
            path: partfile
            mode: 292
          - key: leadfile
            path: leadfile
            mode: 292
  containers:
    - name: dgl-graphsage
      image: 'dgloperator/graphsage:v0.1.0'
      command:
        - sleep
      args:
        - 365d
      ports:
        - name: dglserver
          containerPort: 30050
          protocol: TCP
      volumeMounts:
        - name: shm-volume
          mountPath: /dev/shm
        - name: config-volume
          mountPath: /etc/dgl
      imagePullPolicy: Always

Architecture and workflow design

The Operator side workflow creates ConfigMaps, RBAC resources, initContainers, Partitioner and Worker Pods, watches their status and finally runs the Launcher container that executes dglrun . The dglrun side workflow handles graph partitioning (single‑machine or distributed via ParMETIS) and then launches the distributed training command. After training completes, the Operator cleans up all Pods and releases resources.

Conclusion

Since Kubernetes provides image reuse, isolation, declarative resource definitions and extensible scheduling, it has become the backbone for large‑scale machine‑learning systems. DGL Operator leverages this ecosystem to automate ipconfig generation, graph partitioning, distributed training execution and resource cleanup, embodying MLOps principles. It joins other Kubeflow operators (TensorFlow, PyTorch, MPI) and offers the community a simple tool for large‑scale GNN training in cloud‑native environments.

For more information and to try it out, visit the GitHub repository: https://github.com/Qihoo360/dgl-operator .

Cloud NativeAIKubernetesOperatorGraph Neural Networksdistributed trainingDGL
360 Smart Cloud
Written by

360 Smart Cloud

Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.