Artificial Intelligence 18 min read

DGL Operator: A Kubernetes‑Native Solution for Distributed Graph Neural Network Training

The article introduces DGL Operator, an open‑source Kubernetes‑based controller that automates the lifecycle of distributed graph neural network training with DGL, explains its terminology, challenges of native DGL distribution, and provides detailed architecture, workflow, and YAML/CLI examples for easy deployment.

360 Tech Engineering
360 Tech Engineering
360 Tech Engineering
DGL Operator: A Kubernetes‑Native Solution for Distributed Graph Neural Network Training

Introduction – DGL Operator, developed by 360 AI Platform team, is an open‑source Kubernetes‑native controller that manages the full lifecycle of distributed training for DGL (Deep Graph Library) graph neural networks. The project is hosted on GitHub: https://github.com/Qihoo360/dgl-operator.

Terminology – The article defines key concepts such as Overload (logical workload), Job (training task), Pod, initContainer, Worker Pod, Partitioner Pod, Launcher Pod, ipconfig, kubexec.sh, single‑machine partitioning, and distributed partitioning.

Background and Challenges – While DGL provides powerful GNN APIs, industrial‑scale training (tens of millions to billions of nodes/edges) faces challenges: manual provisioning of many machines, SSH trust setup, manual graph partitioning, manual script triggering, and resource cleanup.

DGL Operator Solution – By leveraging Kubernetes controllers, DGL Operator automates environment provisioning, ipconfig generation, graph partitioning, distributed training execution, and resource release, turning the entire process into a declarative workflow.

Kubernetes and Operator Basics – Kubernetes automates container deployment, scaling, and management. An Operator extends Kubernetes with custom resources and controllers to manage stateful applications, providing a feedback loop that reconciles desired and actual states.

How to Use DGL Operator – Users submit a DGLJob custom resource via a YAML file to an existing Kubernetes cluster. The Operator creates the necessary ConfigMap, RBAC resources, initContainers, and Pods (Launcher, Partitioner, Workers) to run the distributed training.

API Example – DGLJob YAML

apiVersion: qihoo.net/v1alpha1
kind: DGLJob
metadata:
  name: dgl-graphsage
  namespace: dgl-operator
spec:
  cleanPodPolicy: Running
  partitionMode: DGL-API
  dglReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: dgloperator/graphsage:v0.1.0
            name: dgl-graphsage
            command:
            - dglrun
            args:
            - --graph-name
            - graphsage
            - --partition-entry-point
            - code/load_and_partition_graph.py
            - --num-partitions
            - "2"
            - --train-entry-point
            - code/train_dist.py
            - --num-epochs
            - "1"
            - --batch-size
            - "1000"
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - image: dgloperator/graphsage:v0.1.0
            name: dgl-graphsage

Generated Launcher Pod Definition

kind: Pod
apiVersion: v1
metadata:
  name: dgl-graphsage-launcher
spec:
  volumes:
    - name: kube-volume
      emptyDir: {}
    - name: dataset-volume
      emptyDir: {}
    - name: config-volume
      configMap:
        name: dgl-graphsage-config
        items:
        - key: kubexec.sh
          path: kubexec.sh
          mode: 365
        - key: hostfile
          path: hostfile
          mode: 292
        - key: partfile
          path: partfile
          mode: 292
  initContainers:
    - name: kubectl-download
      image: 'dgloperator/kubectl-download:v0.1.0'
      volumeMounts:
        - name: kube-volume
          mountPath: /opt/kube
      imagePullPolicy: Always
    - name: watcher-loop-partitioner
      image: 'dgloperator/watcher-loop:v0.1.0'
      env:
        - name: WATCHERFILE
          value: /etc/dgl/partfile
        - name: WATCHERMODE
          value: finished
      volumeMounts:
        - name: config-volume
          mountPath: /etc/dgl
  containers:
    - name: dgl-graphsage
      image: 'dgloperator/graphsage:v0.1.0'
      command:
        - dglrun
      args:
        - '--graph-name'
        - graphsage
        - '--partition-entry-point'
        - code/load_and_partition_graph.py
        - '--num-partitions'
        - '2'
        - '--balance-train'
        - '--balance-edges'
        - '--train-entry-point'
        - code/train_dist.py
        - '--num-epochs'
        - '1'
        - '--batch-size'
        - '1000'
        - '--num-trainers'
        - '1'
        - '--num-samplers'
        - '4'
        - '--num-servers'
        - '1'
      volumeMounts:
        - name: kube-volume
          mountPath: /opt/kube
        - name: config-volume
          mountPath: /etc/dgl
        - name: dataset-volume
          mountPath: /dgl_workspace/dataset
      imagePullPolicy: Always
  restartPolicy: Never

Generated Partitioner Pod Definition

kind: Pod
apiVersion: v1
metadata:
  name: dgl-graphsage-partitioner
spec:
  volumes:
    - name: config-volume
      configMap:
        name: dgl-graphsage-config
        items:
        - key: kubexec.sh
          path: kubexec.sh
          mode: 365
        - key: hostfile
          path: hostfile
          mode: 292
        - key: partfile
          path: partfile
          mode: 292
        - key: leadfile
          path: leadfile
          mode: 292
    - name: kube-volume
      emptyDir: {}
  initContainers:
    - name: kubectl-download
      image: 'dgloperator/kubectl-download:v0.1.0'
      volumeMounts:
        - name: kube-volume
          mountPath: /opt/kube
      imagePullPolicy: Always
  containers:
    - name: dgl-graphsage
      image: 'dgloperator/graphsage:v0.1.0'
      env:
        - name: DGL_OPERATOR_PHASE_ENV
          value: Partitioner
      volumeMounts:
        - name: config-volume
          mountPath: /etc/dgl
        - name: kube-volume
          mountPath: /opt/kube
      imagePullPolicy: Always
  restartPolicy: Never

Generated Worker Pod Definition

kind: Pod
apiVersion: v1
metadata:
  name: dgl-graphsage-worker-0
spec:
  volumes:
    - name: shm-volume
      emptyDir:
        medium: Memory
        sizeLimit: 10G
    - name: config-volume
      configMap:
        name: dgl-graphsage-config
        items:
        - key: kubexec.sh
          path: kubexec.sh
          mode: 365
        - key: hostfile
          path: hostfile
          mode: 292
        - key: partfile
          path: partfile
          mode: 292
        - key: leadfile
          path: leadfile
          mode: 292
  containers:
    - name: dgl-graphsage
      image: 'dgloperator/graphsage:v0.1.0'
      command:
        - sleep
      args:
        - 365d
      ports:
        - name: dglserver
          containerPort: 30050
          protocol: TCP
      volumeMounts:
        - name: shm-volume
          mountPath: /dev/shm
        - name: config-volume
          mountPath: /etc/dgl
      imagePullPolicy: Always

Architecture and Workflow – The operator implements two layered workflows: the Operator side (creating ConfigMaps, RBAC, initContainers, Pods, and orchestrating the overall job) and the dglrun side (handling graph partitioning, data transfer, and distributed training). Detailed step‑by‑step sequences for both single‑machine and distributed partitioning scenarios are described, accompanied by diagrams.

Conclusion – By integrating DGL training into the Kubernetes ecosystem, DGL Operator automates configuration generation, graph partitioning, distributed execution, and resource cleanup, embodying MLOps principles. It follows the broader trend of ML‑on‑Kubernetes operators (e.g., TF‑Operator, PyTorch‑Operator) and invites community contributions.

AIKubernetesmlopsOperatorGraph Neural Networksdistributed trainingDGL
360 Tech Engineering
Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.