Artificial Intelligence 12 min read

How LWS Enables Scalable Multi‑Node Large Model Deployment on Kubernetes

The article explains how the Dolphin AI platform tackles large‑model deployment challenges by replacing standard Kubernetes Deployments with LeaderWorkerSet, detailing its architecture, features, installation steps, example configurations, testing, scaling, rolling updates, fault recovery, and future roadmap for AI workloads.

Huolala Tech

May 29, 2025

How LWS Enables Scalable Multi‑Node Large Model Deployment on Kubernetes

Introduction

The Dolphin platform is Huolala's self‑developed cloud‑native AI development platform covering data processing, image building, model development, training, deployment and online inference. After two years it became the core AI foundation, improving developer productivity and compute utilization. With the rapid rise of large‑model technology, new challenges appear when deploying such models.

Challenges of Large‑Model Deployment

Dolphin currently uses a Kubernetes Deployment where each Pod runs a full model instance on a single GPU node. Deploying large models faces three problems:

Distributed training frameworks require different start commands per node, but Deployment forces identical commands.

Leader nodes need a fixed IP for other nodes to join, yet Deployment assigns dynamic Pod IPs.

Pod groups need coordinated scaling, rolling updates and fault recovery, which Deployment does not support.

Therefore a different solution is required.

Solution: LeaderWorkerSet (LWS)

After investigation, the team adopted LeaderWorkerSet (LWS), a CRD built on Kubernetes HeadlessService and StatefulSet, to meet AI/ML multi‑node distributed deployment requirements. By integrating LWS with large‑model inference frameworks such as VLLM, SGLang, LmDeploy, Dolphin achieves multi‑node deployment on K8s.

LWS Overview

LWS is a custom workload that treats a group of Pods as a single unit (PodGroup). Each Pod in the group receives a unique index and shares a common lifecycle, with a globally unique network address.

Key features:

PodGroup as a whole unit with indexed Pods.

Support for different templates for LeaderPod and WorkerPod.

Group‑level scaling, rolling updates and fault recovery.

Implementation Details

PodGroup Features

PodGroup as a whole unit: All Pods share a lifecycle and have unique indices.

Multiple templates: LeaderPod and WorkerPod can specify different containers, commands, resources, etc.

Scaling: Multiple replicas of the group can be created.

Rolling update: The entire group is updated together.

Fault recovery: If any Pod fails, the whole group is recreated.

LeaderPod Fixed Network Identity

Using StatefulSet together with a HeadlessService, each Pod receives a stable DNS name ( {podName}.{serviceName}.namespace.svc.cluster.local), ensuring the LeaderPod keeps a fixed address for Workers to connect.

Scaling Mechanism

When spec.replicas changes, the LWS controller scales LeaderStatefulSet and creates or removes corresponding WorkerStatefulSets, guaranteeing atomic group scaling.

Practical Steps

Install LWS

VERSION=v0.6.1
kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws/releases/download/$VERSION/manifests.yaml

SGLang Deployment Example

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: sglang
spec:
  replicas: 2
  leaderWorkerTemplate:
    size: 2
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        containers:
        - name: sglang-leader
          image: lmsysorg/sglang:latest
          env:
          - name: HUGGING_FACE_HUB_TOKEN
            value: <your-hf-token>
          command:
          - python3
          - -m
          - sglang.launch_server
          - --model-path
          - meta-llama/Meta-Llama-3.1-8B-Instruct
          - --tp
          - "2"
          - --dist-init-addr
          - $(LWS_LEADER_ADDRESS):20000
          - --nnodes
          - $(LWS_GROUP_SIZE)
          - --node-rank
          - $(LWS_WORKER_INDEX)
          - --trust-remote-code
          - --host
          - "0.0.0.0"
          - --port
          - "40000"
          resources:
            limits:
              nvidia.com/gpu: "1"
          ports:
          - containerPort: 40000
    workerTemplate:
      spec:
        containers:
        - name: sglang-worker
          image: lmsysorg/sglang:latest
          env:
          - name: HUGGING_FACE_HUB_TOKEN
            value: <your-hf-token>
          command:
          - python3
          - -m
          - sglang.launch_server
          - --model-path
          - meta-llama/Meta-Llama-3.1-8B-Instruct
          - --tp
          - "2"
          - --dist-init-addr
          - $(LWS_LEADER_ADDRESS):20000
          - --nnodes
          - $(LWS_GROUP_SIZE)
          - --node-rank
          - $(LWS_WORKER_INDEX)
          - --trust-remote-code
          resources:
            limits:
              nvidia.com/gpu: "1"

Inference Test

curl http://localhost:40000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "role": "user",
    "prompt": "What is the meaning of life?"
}'

Scaling and Rolling Update

Adjust spec.replicas to scale. Rolling updates are controlled via rolloutStrategy with maxUnavailable and maxSurge fields.

spec:
  rolloutStrategy:
    type: RollingUpdate
    rollingUpdateConfiguration:
      maxUnavailable: 2
      maxSurge: 2
  replicas: 4

Fault Recovery

LeaderPod acts as the health probe for the whole group. If its readiness probe fails, the entire PodGroup restarts.

readinessProbe:
  tcpSocket:
    port: 40000
  initialDelaySeconds: 15

Future Plans

The platform will further enhance large‑model capabilities by building a model marketplace, adding distributed fine‑tuning pipelines with RDMA and high‑performance storage, optimizing compute resource scheduling, and constructing a multi‑region, multi‑cluster architecture for higher availability.

Distributed inference Kubernetes large models AI Platform LeaderWorkerSet

Written by

Huolala Tech

Technology reshapes logistics

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.