Cloud Native 15 min read

How WorkloadSpread Enables Elastic and Topology‑Aware Deployments in Kubernetes

This article explains the background, features, and implementation details of OpenKruise's WorkloadSpread, showing how it distributes Pods across zones, node types, and CPU architectures, compares it with existing solutions, and provides concrete YAML examples for elastic and topology‑aware scheduling.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
How WorkloadSpread Enables Elastic and Topology‑Aware Deployments in Kubernetes

Background

Applications often need to run across multiple zones, hardware types, clusters, or cloud providers. Traditional approaches split an app into several workloads (e.g., Deployments) and require manual or deep PaaS customizations.

Typical topology requirements include scattering Pods by node for fault‑tolerance, by availability zone (AZ) for resilience, and specifying proportional distribution across zones.

WorkloadSpread Overview

OpenKruise v0.10.0 introduced WorkloadSpread , a side‑car‑style controller that works with Deployments, ReplicaSets, and CloneSets to control pod placement and elastic scaling without modifying the original workload spec.

Comparison with Existing Solutions

Pod Topology Spread Constraints

Kubernetes’ PodTopologySpread spreads Pods evenly based on a topology key, but it cannot define custom partition counts or ratios and may break the distribution during scale‑down.

UnitedDeployment

UnitedDeployment (a Kruise‑provided workload) manages multiple workloads across regions, offering strong scattering and elasticity. However, it introduces a new workload type, increasing migration cost. WorkloadSpread is a lightweight alternative that only adds a configuration to an existing workload.

Application Scenarios

1. Limit 100 replicas to a base node pool, excess to an elastic pool

subsets:
- name: subset-normal
  maxReplicas: 100
  requiredNodeSelectorTerm:
    matchExpressions:
    - key: app.deploy/zone
      operator: In
      values:
      - normal
- name: subset-elastic # no replica limit
  requiredNodeSelectorTerm:
    matchExpressions:
    - key: app.deploy/zone
      operator: In
      values:
      - elastic

If the workload has fewer than 100 replicas, all Pods go to the normal pool; excess replicas are placed in the elastic pool, and scale‑down prefers deleting Pods from the elastic pool.

2. Prefer base pool, fall back to elastic pool when resources are insufficient

scheduleStrategy:
  type: Adaptive
  adaptive:
    rescheduleCriticalSeconds: 30
    disableSimulationSchedule: false
subsets:
- name: subset-normal # unlimited replicas
  requiredNodeSelectorTerm:
    matchExpressions:
    - key: app.deploy/zone
      operator: In
      values:
      - normal
- name: subset-elastic # unlimited replicas
  requiredNodeSelectorTerm:
    matchExpressions:
    - key: app.deploy/zone
      operator: In
      values:
      - elastic

The Adaptive scheduler first places Pods on the normal pool; if they remain pending for more than 30 seconds, the controller deletes them to trigger recreation on the elastic pool. Scale‑down also prefers the elastic pool.

3. Scatter across three zones with a 1:1:3 ratio

subsets:
- name: subset-a
  maxReplicas: 20%
  requiredNodeSelectorTerm:
    matchExpressions:
    - key: topology.kubernetes.io/zone
      operator: In
      values:
      - zone-a
- name: subset-b
  maxReplicas: 20%
  requiredNodeSelectorTerm:
    matchExpressions:
    - key: topology.kubernetes.io/zone
      operator: In
      values:
      - zone-b
- name: subset-c
  maxReplicas: 60%
  requiredNodeSelectorTerm:
    matchExpressions:
    - key: topology.kubernetes.io/zone
      operator: In
      values:
      - zone-c

The controller ensures that scaling respects the 1:1:3 distribution across the three zones.

4. Different resource quotas for distinct CPU architectures

subsets:
- name: subset-x86-arch
  patch:
    metadata:
      labels:
        resource.cpu/arch: x86
    spec:
      containers:
      - name: main
        resources:
          limits:
            cpu: "500m"
            memory: "800Mi"
- name: subset-arm-arch
  patch:
    metadata:
      labels:
        resource.cpu/arch: arm
    spec:
      containers:
      - name: main
        resources:
          limits:
            cpu: "300m"
            memory: "600Mi"

Each subset patches Pods with architecture‑specific labels and resource limits, enabling fine‑grained management of heterogeneous nodes.

Implementation Principles

1. Subset Priority and Replica Control

Subsets are ordered; higher‑priority subsets are expanded first and shrunk last. The controller adjusts the controller.kubernetes.io/pod-deletion-cost annotation (supported from Kubernetes 1.21 and Kruise v0.9.0 for CloneSets) to enforce the desired scale‑down order.

2. Controlling Scale‑Down Order

Deletion‑cost values are set per subset: a higher cost means lower deletion priority. When the WorkloadSpread spec changes, the controller updates these costs so that Pods exceeding a subset’s limit are deleted first.

3. Quantity Management and Concurrency

The WorkloadSpread status tracks missingReplicas for each subset (‑1 indicates no limit). The webhook processes Pod create/delete/eviction requests, updates missingReplicas, and injects subset rules. To avoid optimistic‑lock conflicts during massive parallel Pod creation, a per‑WorkloadSpread mutex is used together with optimistic retries and caching of the latest object.

4. Adaptive Scheduling

When scheduleStrategy.type is Adaptive, the webhook performs a simulated schedule based on current node and pod information. If a Pod cannot be placed within rescheduleCriticalSeconds, the subset is marked unschedulable for a configurable period (default 5 minutes), and the Pod is deleted to trigger recreation on another subset.

Conclusion

WorkloadSpread provides a side‑car solution that leverages existing Kubernetes mechanisms to give workloads elastic, multi‑domain deployment capabilities without altering the original workload definition. It simplifies deployment complexity, reduces cost through intelligent scaling, and is used in production environments.

References

https://github.com/openkruise/kruise

https://openkruise.io/

https://openkruise.io/zh-cn/docs/workloadspread.html

https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/

https://openkruise.io/zh-cn/docs/uniteddeployment.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KubernetesOpenKruiseWorkloadSpreadElastic DeploymentTopology Spread
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.