How WorkloadSpread Enables Elastic and Topology‑Aware Deployments in Kubernetes
This article explains the background, features, and implementation details of OpenKruise's WorkloadSpread, showing how it distributes Pods across zones, node types, and CPU architectures, compares it with existing solutions, and provides concrete YAML examples for elastic and topology‑aware scheduling.
Background
Applications often need to run across multiple zones, hardware types, clusters, or cloud providers. Traditional approaches split an app into several workloads (e.g., Deployments) and require manual or deep PaaS customizations.
Typical topology requirements include scattering Pods by node for fault‑tolerance, by availability zone (AZ) for resilience, and specifying proportional distribution across zones.
WorkloadSpread Overview
OpenKruise v0.10.0 introduced WorkloadSpread , a side‑car‑style controller that works with Deployments, ReplicaSets, and CloneSets to control pod placement and elastic scaling without modifying the original workload spec.
Comparison with Existing Solutions
Pod Topology Spread Constraints
Kubernetes’ PodTopologySpread spreads Pods evenly based on a topology key, but it cannot define custom partition counts or ratios and may break the distribution during scale‑down.
UnitedDeployment
UnitedDeployment (a Kruise‑provided workload) manages multiple workloads across regions, offering strong scattering and elasticity. However, it introduces a new workload type, increasing migration cost. WorkloadSpread is a lightweight alternative that only adds a configuration to an existing workload.
Application Scenarios
1. Limit 100 replicas to a base node pool, excess to an elastic pool
subsets:
- name: subset-normal
maxReplicas: 100
requiredNodeSelectorTerm:
matchExpressions:
- key: app.deploy/zone
operator: In
values:
- normal
- name: subset-elastic # no replica limit
requiredNodeSelectorTerm:
matchExpressions:
- key: app.deploy/zone
operator: In
values:
- elasticIf the workload has fewer than 100 replicas, all Pods go to the normal pool; excess replicas are placed in the elastic pool, and scale‑down prefers deleting Pods from the elastic pool.
2. Prefer base pool, fall back to elastic pool when resources are insufficient
scheduleStrategy:
type: Adaptive
adaptive:
rescheduleCriticalSeconds: 30
disableSimulationSchedule: false
subsets:
- name: subset-normal # unlimited replicas
requiredNodeSelectorTerm:
matchExpressions:
- key: app.deploy/zone
operator: In
values:
- normal
- name: subset-elastic # unlimited replicas
requiredNodeSelectorTerm:
matchExpressions:
- key: app.deploy/zone
operator: In
values:
- elasticThe Adaptive scheduler first places Pods on the normal pool; if they remain pending for more than 30 seconds, the controller deletes them to trigger recreation on the elastic pool. Scale‑down also prefers the elastic pool.
3. Scatter across three zones with a 1:1:3 ratio
subsets:
- name: subset-a
maxReplicas: 20%
requiredNodeSelectorTerm:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- zone-a
- name: subset-b
maxReplicas: 20%
requiredNodeSelectorTerm:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- zone-b
- name: subset-c
maxReplicas: 60%
requiredNodeSelectorTerm:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- zone-cThe controller ensures that scaling respects the 1:1:3 distribution across the three zones.
4. Different resource quotas for distinct CPU architectures
subsets:
- name: subset-x86-arch
patch:
metadata:
labels:
resource.cpu/arch: x86
spec:
containers:
- name: main
resources:
limits:
cpu: "500m"
memory: "800Mi"
- name: subset-arm-arch
patch:
metadata:
labels:
resource.cpu/arch: arm
spec:
containers:
- name: main
resources:
limits:
cpu: "300m"
memory: "600Mi"Each subset patches Pods with architecture‑specific labels and resource limits, enabling fine‑grained management of heterogeneous nodes.
Implementation Principles
1. Subset Priority and Replica Control
Subsets are ordered; higher‑priority subsets are expanded first and shrunk last. The controller adjusts the controller.kubernetes.io/pod-deletion-cost annotation (supported from Kubernetes 1.21 and Kruise v0.9.0 for CloneSets) to enforce the desired scale‑down order.
2. Controlling Scale‑Down Order
Deletion‑cost values are set per subset: a higher cost means lower deletion priority. When the WorkloadSpread spec changes, the controller updates these costs so that Pods exceeding a subset’s limit are deleted first.
3. Quantity Management and Concurrency
The WorkloadSpread status tracks missingReplicas for each subset (‑1 indicates no limit). The webhook processes Pod create/delete/eviction requests, updates missingReplicas, and injects subset rules. To avoid optimistic‑lock conflicts during massive parallel Pod creation, a per‑WorkloadSpread mutex is used together with optimistic retries and caching of the latest object.
4. Adaptive Scheduling
When scheduleStrategy.type is Adaptive, the webhook performs a simulated schedule based on current node and pod information. If a Pod cannot be placed within rescheduleCriticalSeconds, the subset is marked unschedulable for a configurable period (default 5 minutes), and the Pod is deleted to trigger recreation on another subset.
Conclusion
WorkloadSpread provides a side‑car solution that leverages existing Kubernetes mechanisms to give workloads elastic, multi‑domain deployment capabilities without altering the original workload definition. It simplifies deployment complexity, reduces cost through intelligent scaling, and is used in production environments.
References
https://github.com/openkruise/kruise
https://openkruise.io/
https://openkruise.io/zh-cn/docs/workloadspread.html
https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/
https://openkruise.io/zh-cn/docs/uniteddeployment.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
