How LWS Enables Scalable Multi‑Node Large Model Deployment on Kubernetes
The article explains how the Dolphin AI platform tackles large‑model deployment challenges by replacing standard Kubernetes Deployments with LeaderWorkerSet, detailing its architecture, features, installation steps, example configurations, testing, scaling, rolling updates, fault recovery, and future roadmap for AI workloads.
Introduction
The Dolphin platform is Huolala's self‑developed cloud‑native AI development platform covering data processing, image building, model development, training, deployment and online inference. After two years it became the core AI foundation, improving developer productivity and compute utilization. With the rapid rise of large‑model technology, new challenges appear when deploying such models.
Challenges of Large‑Model Deployment
Dolphin currently uses a Kubernetes Deployment where each Pod runs a full model instance on a single GPU node. Deploying large models faces three problems:
Distributed training frameworks require different start commands per node, but Deployment forces identical commands.
Leader nodes need a fixed IP for other nodes to join, yet Deployment assigns dynamic Pod IPs.
Pod groups need coordinated scaling, rolling updates and fault recovery, which Deployment does not support.
Therefore a different solution is required.
Solution: LeaderWorkerSet (LWS)
After investigation, the team adopted LeaderWorkerSet (LWS), a CRD built on Kubernetes HeadlessService and StatefulSet, to meet AI/ML multi‑node distributed deployment requirements. By integrating LWS with large‑model inference frameworks such as VLLM, SGLang, LmDeploy, Dolphin achieves multi‑node deployment on K8s.
LWS Overview
LWS is a custom workload that treats a group of Pods as a single unit (PodGroup). Each Pod in the group receives a unique index and shares a common lifecycle, with a globally unique network address.
Key features:
PodGroup as a whole unit with indexed Pods.
Support for different templates for LeaderPod and WorkerPod.
Group‑level scaling, rolling updates and fault recovery.
Implementation Details
PodGroup Features
PodGroup as a whole unit: All Pods share a lifecycle and have unique indices.
Multiple templates: LeaderPod and WorkerPod can specify different containers, commands, resources, etc.
Scaling: Multiple replicas of the group can be created.
Rolling update: The entire group is updated together.
Fault recovery: If any Pod fails, the whole group is recreated.
LeaderPod Fixed Network Identity
Using StatefulSet together with a HeadlessService, each Pod receives a stable DNS name ( {podName}.{serviceName}.namespace.svc.cluster.local), ensuring the LeaderPod keeps a fixed address for Workers to connect.
Scaling Mechanism
When spec.replicas changes, the LWS controller scales LeaderStatefulSet and creates or removes corresponding WorkerStatefulSets, guaranteeing atomic group scaling.
Practical Steps
Install LWS
VERSION=v0.6.1
kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws/releases/download/$VERSION/manifests.yamlSGLang Deployment Example
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: sglang
spec:
replicas: 2
leaderWorkerTemplate:
size: 2
restartPolicy: RecreateGroupOnPodRestart
leaderTemplate:
metadata:
labels:
role: leader
spec:
containers:
- name: sglang-leader
image: lmsysorg/sglang:latest
env:
- name: HUGGING_FACE_HUB_TOKEN
value: <your-hf-token>
command:
- python3
- -m
- sglang.launch_server
- --model-path
- meta-llama/Meta-Llama-3.1-8B-Instruct
- --tp
- "2"
- --dist-init-addr
- $(LWS_LEADER_ADDRESS):20000
- --nnodes
- $(LWS_GROUP_SIZE)
- --node-rank
- $(LWS_WORKER_INDEX)
- --trust-remote-code
- --host
- "0.0.0.0"
- --port
- "40000"
resources:
limits:
nvidia.com/gpu: "1"
ports:
- containerPort: 40000
workerTemplate:
spec:
containers:
- name: sglang-worker
image: lmsysorg/sglang:latest
env:
- name: HUGGING_FACE_HUB_TOKEN
value: <your-hf-token>
command:
- python3
- -m
- sglang.launch_server
- --model-path
- meta-llama/Meta-Llama-3.1-8B-Instruct
- --tp
- "2"
- --dist-init-addr
- $(LWS_LEADER_ADDRESS):20000
- --nnodes
- $(LWS_GROUP_SIZE)
- --node-rank
- $(LWS_WORKER_INDEX)
- --trust-remote-code
resources:
limits:
nvidia.com/gpu: "1"Inference Test
curl http://localhost:40000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"role": "user",
"prompt": "What is the meaning of life?"
}'Scaling and Rolling Update
Adjust spec.replicas to scale. Rolling updates are controlled via rolloutStrategy with maxUnavailable and maxSurge fields.
spec:
rolloutStrategy:
type: RollingUpdate
rollingUpdateConfiguration:
maxUnavailable: 2
maxSurge: 2
replicas: 4Fault Recovery
LeaderPod acts as the health probe for the whole group. If its readiness probe fails, the entire PodGroup restarts.
readinessProbe:
tcpSocket:
port: 40000
initialDelaySeconds: 15Future Plans
The platform will further enhance large‑model capabilities by building a model marketplace, adding distributed fine‑tuning pipelines with RDMA and high‑performance storage, optimizing compute resource scheduling, and constructing a multi‑region, multi‑cluster architecture for higher availability.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
