Cloud Native 19 min read

Mastering Kubernetes Descheduler: Strategies to Balance Your Cluster

Learn how to use Kubernetes Descheduler to rebalance uneven pod distribution across nodes by configuring various built‑in strategies, custom policies, filtering options, and deployment methods such as Jobs and CronJobs, with detailed examples and best‑practice guidelines for production clusters.

Ops Development Stories
Ops Development Stories
Ops Development Stories
Mastering Kubernetes Descheduler: Strategies to Balance Your Cluster

Kubernetes's kube-scheduler assigns Pods to Nodes, but the highly dynamic nature of clusters can lead to uneven pod distribution due to low‑utilized nodes, node failures, added or removed labels/taints, and new nodes joining the cluster.

Some nodes are under‑utilized or over‑utilized.

Changes in pod or node affinity break previous scheduling decisions.

Node failures cause pods to be rescheduled elsewhere.

New nodes are added to the cluster.

When such imbalances occur, the Descheduler can be used to rebalance the cluster by evicting Pods according to configurable strategies.

Descheduler

Descheduler applies a set of strategies to identify Pods that should be evicted so that the cluster reaches a more balanced state. All strategies are enabled by default but can be turned on or off individually.

RemoveDuplicates

LowNodeUtilization

RemovePodsViolatingInterPodAntiAffinity

RemovePodsViolatingNodeAffinity

RemovePodsViolatingNodeTaints

RemovePodsViolatingTopologySpreadConstraint

RemovePodsHavingTooManyRestarts

PodLifeTime

Common configuration options include: nodeSelector: restricts which nodes are processed. evictLocalStoragePods: evicts Pods that use LocalStorage. ignorePvcPods: when set to true, Pods with PVCs are ignored (default false). maxNoOfPodsToEvictPerNode: maximum number of Pods that can be evicted from a node.

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
nodeSelector: prod=dev
evictLocalStoragePods: true
maxNoOfPodsToEvictPerNode: 40
ignorePvcPods: false
strategies:
  ...

RemoveDuplicates

This strategy ensures that only one Pod from the same ReplicaSet, ReplicationController, Deployment, or Job runs on a node. Duplicate Pods are evicted to improve distribution, especially after a node recovers from failure. excludeOwnerKinds (list of strings): owner kinds to exclude from eviction. namespaces (list of strings): namespaces to consider. thresholdPriority (int): priority threshold for eviction. thresholdPriorityClassName (string): priority class name for eviction.

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemoveDuplicates":
    enabled: true
    params:
      removeDuplicates:
        excludeOwnerKinds:
        - "ReplicaSet"

LowNodeUtilization

This strategy identifies under‑utilized nodes and evicts Pods to those nodes. Thresholds for CPU, memory, and pod count are defined under nodeResourceUtilizationThresholds. A separate targetThresholds defines over‑utilized nodes from which Pods may be evicted. thresholds (map): resource usage percentages that define a low‑utilization node. targetThresholds (map): percentages that define a high‑utilization node. numberOfNodes (int): minimum number of low‑utilization nodes required to activate the strategy. thresholdPriority (int) and thresholdPriorityClassName (string): priority filtering.

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "LowNodeUtilization":
    enabled: true
    params:
      nodeResourceUtilizationThresholds:
        thresholds:
          "cpu": 20
          "memory": 20
          "pods": 20
        targetThresholds:
          "cpu": 50
          "memory": 50
          "pods": 50

RemovePodsViolatingInterPodAntiAffinity

Evicts Pods that break inter‑pod anti‑affinity rules, ensuring that Pods with mutually exclusive placement constraints are not co‑located on the same node. thresholdPriority (int) thresholdPriorityClassName (string) namespaces (list of strings)

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsViolatingInterPodAntiAffinity":
    enabled: true

RemovePodsViolatingNodeAffinity

When enabled, the requiredDuringSchedulingIgnoredDuringExecution node affinity is treated as a temporary requirement and Pods violating it are evicted. thresholdPriority (int) thresholdPriorityClassName (string) namespaces (list of strings) nodeAffinityType (list of strings): e.g., requiredDuringSchedulingIgnoredDuringExecution.

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsViolatingNodeAffinity":
    enabled: true
    params:
      nodeAffinityType:
      - "requiredDuringSchedulingIgnoredDuringExecution"

RemovePodsViolatingNodeTaints

Evicts Pods that do not tolerate a node's NoSchedule taint. thresholdPriority (int) thresholdPriorityClassName (string) namespaces (list of strings)

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsViolatingNodeTaints":
    enabled: true

RemovePodsViolatingTopologySpreadConstraint

Ensures Pods are spread across topology domains within the maxSkew limit. Soft constraints can be enabled by setting includeSoftConstraints to true (requires Kubernetes ≥ 1.18). thresholdPriority (int) thresholdPriorityClassName (string) namespaces (list of strings) includeSoftConstraints (bool)

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsViolatingTopologySpreadConstraint":
    enabled: true
    params:
      includeSoftConstraints: false

RemovePodsHavingTooManyRestarts

Evicts Pods that have exceeded a restart threshold, optionally considering init container restarts. podRestartThreshold (int) includingInitContainers (bool) thresholdPriority (int) thresholdPriorityClassName (string) namespaces (list of strings)

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsHavingTooManyRestarts":
    enabled: true
    params:
      podsHavingTooManyRestarts:
        podRestartThreshold: 100
        includingInitContainers: true

PodLifeTime

Evicts Pods older than maxPodLifeTimeSeconds. The podStatusPhases field selects which Pod phases are subject to eviction. maxPodLifeTimeSeconds (int) podStatusPhases (list of strings) thresholdPriority (int) thresholdPriorityClassName (string) namespaces (list of strings)

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
    enabled: true
    params:
      podLifeTime:
        maxPodLifeTimeSeconds: 86400
        podStatusPhases:
        - "Pending"

Filter Pods

Descheduler allows selective eviction through namespace and priority filters.

Namespace filtering

Strategies can include or exclude specific namespaces using include or exclude lists.

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
    enabled: true
    params:
      podLifeTime:
        maxPodLifeTimeSeconds: 86400
        namespaces:
          include:
          - "namespace1"
          - "namespace2"
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
    enabled: true
    params:
      podLifeTime:
        maxPodLifeTimeSeconds: 86400
        namespaces:
          exclude:
          - "namespace1"
          - "namespace2"

Priority filtering

All strategies support priority thresholds; only Pods with a priority lower than the configured value are eligible for eviction. Use either thresholdPriority (numeric) or thresholdPriorityClassName (class name). Both cannot be set simultaneously.

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
    enabled: true
    params:
      podLifeTime:
        maxPodLifeTimeSeconds: 86400
        thresholdPriority: 10000
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
    enabled: true
    params:
      podLifeTime:
        maxPodLifeTimeSeconds: 86400
        thresholdPriorityClassName: "priorityclass1"
Note: thresholdPriority and thresholdPriorityClassName cannot be configured together. If the specified priority class does not exist, Descheduler will fail.

Pod Evictions

Critical system Pods (priority class system-cluster-critical or system-node-critical) are never evicted.

Pods not managed by a ReplicaSet, ReplicationController, Deployment, or Job are ignored.

DaemonSet Pods are never evicted.

Pods using LocalStorage are protected unless evictLocalStoragePods: true is set.

Pods with PVCs are evicted unless ignorePvcPods: true is set.

Under LowNodeUtilization and RemovePodsViolatingInterPodAntiAffinity, Pods are evicted from low to high priority; within the same priority, BestEffort Pods are evicted before Burstable and Guaranteed Pods.

Pods annotated with descheduler.alpha.kubernetes.io/evict can be forced to evict.

If eviction fails, increase verbosity with --v=4 or inspect Descheduler logs.

Pods protected by PodDisruptionBudgets (PDB) are not evicted.

Version Compatibility

Descheduler v0.20 → Kubernetes v1.20

Descheduler v0.19 → Kubernetes v1.19

Descheduler v0.18 → Kubernetes v1.18

Descheduler v0.10 → Kubernetes v1.17

Descheduler v0.4‑v0.9 → Kubernetes v1.9+

Descheduler v0.1‑v0.3 → Kubernetes v1.7‑v1.8

Practice

1. Download the matching Descheduler version

$ wget https://github.com/kubernetes-sigs/descheduler/archive/v0.18.0.tar.gz

2. Create RBAC resources

---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: descheduler-cluster-role
  namespace: kube-system
rules:
- apiGroups: [""]
  resources: ["events"]
  verbs: ["create", "update"]
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get", "watch", "list"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "watch", "list", "delete"]
- apiGroups: [""]
  resources: ["pods/eviction"]
  verbs: ["create"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: descheduler-sa
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: descheduler-cluster-role-binding
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: descheduler-cluster-role
subjects:
- name: descheduler-sa
  kind: ServiceAccount
  namespace: kube-system

3. Create a ConfigMap with the policy

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: descheduler-policy-configmap
  namespace: kube-system
data:
  policy.yaml: |
    apiVersion: "descheduler/v1alpha1"
    kind: "DeschedulerPolicy"
    strategies:
      "RemoveDuplicates":
        enabled: true
      "RemovePodsViolatingInterPodAntiAffinity":
        enabled: true
      "LowNodeUtilization":
        enabled: true
        params:
          nodeResourceUtilizationThresholds:
            thresholds:
              "cpu": 20
              "memory": 20
              "pods": 20
            targetThresholds:
              "cpu": 50
              "memory": 50
              "pods": 50

4. Run Descheduler as a Job

---
apiVersion: batch/v1
kind: Job
metadata:
  name: descheduler-job
  namespace: kube-system
spec:
  parallelism: 1
  completions: 1
  template:
    metadata:
      name: descheduler-pod
    spec:
      priorityClassName: system-cluster-critical
      containers:
      - name: descheduler
        image: us.gcr.io/k8s-artifacts-prod/descheduler/descheduler:v0.10.0
        volumeMounts:
        - mountPath: /policy-dir
          name: policy-volume
        command:
        - "/bin/descheduler"
        args:
        - "--policy-config-file"
        - "/policy-dir/policy.yaml"
        - "--v"
        - "3"
      restartPolicy: "Never"
      serviceAccountName: descheduler-sa
      volumes:
      - name: policy-volume
        configMap:
          name: descheduler-policy-configmap

5. Schedule periodic evictions with a CronJob

---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: descheduler-cronjob
  namespace: kube-system
spec:
  schedule: "*/2 * * * *"
  concurrencyPolicy: "Forbid"
  jobTemplate:
    spec:
      template:
        metadata:
          name: descheduler-pod
        spec:
          priorityClassName: system-cluster-critical
          containers:
          - name: descheduler
            image: us.gcr.io/k8s-artifacts-prod/descheduler/descheduler:v0.10.0
            volumeMounts:
            - mountPath: /policy-dir
              name: policy-volume
            command:
            - "/bin/descheduler"
            args:
            - "--policy-config-file"
            - "/policy-dir/policy.yaml"
            - "--v"
            - "3"
          restartPolicy: "Never"
          serviceAccountName: descheduler-sa
          volumes:
          - name: policy-volume
            configMap:
              name: descheduler-policy-configmap
Reference: https://github.com/kubernetes-sigs/descheduler
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Kubernetesk8sPod SchedulingDeschedulerCluster balancing
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.