Cloud Native 20 min read

Kubernetes Water Level Balancing Scheduler: Design, Implementation, and Evaluation

The water‑level‑balanced Kubernetes scheduler plugin continuously gathers historical CPU/memory usage from Prometheus, applies a water‑level algorithm during the PreFilter‑Score phases to place low‑utilization Pods on high‑utilization Nodes (and vice‑versa), thereby equalizing node load, improving cluster stability, and increasing overall resource utilization.

HelloTech

Jun 28, 2022

Kubernetes Water Level Balancing Scheduler: Design, Implementation, and Evaluation

The default Kubernetes scheduler balances resources based on the requested values of Pods, which are often over‑estimated compared to actual usage. This leads to a large gap between the allocated request and the real utilization (water‑level) of Nodes, especially in large or long‑running clusters. The imbalance causes some Nodes to be overloaded while others remain under‑utilized.

To address this, a water‑level‑balanced scheduler is introduced. It continuously monitors historical resource usage of Nodes and Pods, then during scheduling it applies a water‑level algorithm: Pods with low water‑level are placed on high‑water‑level Nodes, and vice‑versa, aiming to equalize the water‑level across the whole cluster, improve stability, and increase resource utilization.

Scheduler workflow : The scheduler watches etcd for Pods without a NodeName, selects the most suitable Node, and sets the NodeName. Since Kubernetes 1.16 the Scheduling Framework provides extensible plug‑in points. The main phases are:

PreFilter – preprocess Pod information and abort if necessary.

Filter – reject Nodes that do not satisfy constraints.

PreScore – compute auxiliary data for scoring.

Score – assign a numeric score to each feasible Node (including optional NormalizeScore).

Bind – finally bind the Pod to the chosen Node.

Permit, PreBind, PostBind – additional hooks for approval, preparation, and cleanup.

The following pseudo‑code illustrates the scheduling cycle:

allNode = K8S所有的node节点
for PreFilter in (plugin1, plugin2, ...):
  IsSuccess = PreFilter(state *CycleState, pod *v1.Pod)
  if IsSuccess is False:
    return // 调度周期结束
feasibleNodes = [] // 存储Filter阶段符合条件的Node
for nodeInfo in allNode:
  IsSuccess = False
  for Filter in (plugin1, plugin2, ...):
    IsSuccess = Filter(state *CycleState, pod *v1.Pod, nodeInfo *NodeInfo)
    if IsSuccess is False:
      break
  if IsSuccess = True:
    feasibleNodes.append(nodeInfo)
if len(feasibleNodes) == 1:
  return feasibleNodes[0]
for PreScore in (plugin1, plugin2, ...):
  PreScore(state *CycleState, pod *v1.Pod)
NodeScores = {}
for index, nodeInfo in feasibleNodes:
  for Score in (plugin1, plugin2, ...):
    score = Score(state *CycleState, pod *v1.Pod, nodeInfo *NodeInfo)
    NodeScores[插件名][index] = score
for NormalizeScore in (plugin1, plugin2, ...):
  nodeScoreList = NodeScores[插件名]
  NormalizeScore(state *CycleState, p *v1.Pod, scores NodeScoreList)
for pluginName, nodeScoreList in NodeScores:
  for nodeScore in nodeScoreList:
    nodeScore.Score = nodeScore.Score * int64(pluginWeight)
result = []
for nodeIndex, nodeName in feasibleNodes {
  _result = {Name: nodeName, Score: 0}
  for pluginName, _ in NodeScores {
    _result.Score += NodeScores[pluginName][nodeIndex].Score
  }
  result.append(_result)
}
Node = selectHost(result)
return Node

Plugin configuration can be done via a KubeSchedulerConfiguration YAML (or ConfigMap). Example:

apiVersion: kubescheduler.config.k8s.io/v1alpha2
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
  plugins:
    preFilter:
      enabled:
      - name: HheWaterLevelBalance
    filter:
      enabled:
      - name: HheWaterLevelBalance
      - name: HkePodTopologySpread
    preScore:
      enabled:
      - name: HkePodTopologySpread
      - name: HheWaterLevelBalance
    score:
      enabled:
      - name: HkePodTopologySpread   // enable custom plugin
      - name: HheWaterLevelBalance
      disabled:
      - name: ImageLocality          // disable default plugin
      - name: InterPodAffinity
    postBind:
      enabled:
      - name: HheWaterLevelBalance
  pluginConfig:
  - name: HheWaterLevelBalance
    args:
      clusterCpuMinNodeWeight: 0.2

Research of existing plugins : The TargetLoadPacking plugin from kubernetes‑sigs calculates a target CPU utilization (target_cpu = node_cpu + pod_cpu) and scores Nodes based on their distance to a preset cluster_cpu target. The algorithm resembles a best‑fit variant of the knapsack problem. However, the original scoring formula contains a constant “50” that can produce unintuitive results when target_cpu is low.

Custom solution – crane‑scheduler : It periodically pulls metrics (cpu/mem averages over 5 min, 1 h, 1 d) from Prometheus, writes them into Node annotations, and filters out overloaded Nodes in the filter phase. In the score phase it combines weighted metrics to prefer low‑load Nodes. Hot‑spot mitigation includes penalising Nodes that have scheduled many Pods in the recent minute.

Water‑level acquisition :

Node water‑level: read real load from Prometheus and store in an annotation.

Pod water‑level: for Deployments/Clonesets read the annotation; for other Pods use the limit as a proxy.

Scoring formula (simplified) :

cluster_cpu = preset ideal value
if target_cpu <= cluster_cpu:
  score = (100 - cluster_cpu) * target_cpu / cluster_cpu + cluster_cpu
elif cluster_cpu < target_cpu <= 100:
  score = cluster_cpu * (100 - target_cpu) / (100 - cluster_cpu)
else:
  score = 0

Example with cluster_cpu = 20 % and five Nodes (Na‑Ne) yields scores Sa‑Se = 24, 40, 19, 13, 3 respectively, demonstrating that the algorithm pushes Pods toward Nodes whose water‑level is close to the ideal.

Handling business peaks and troughs : The algorithm considers three time windows (15 min, 1 h, 1 d) and dynamically adjusts cluster_cpu using a weighted average of the overall cluster average and the minimum Node water‑level.

Hot‑spot problem mitigation involves maintaining a local cache (ScheduledPodsCache) of recent Pod bindings, detecting missing utilization data, and cleaning stale entries.

Implementation details include registering the custom plugin in the scheduler binary:

import (
  "math/rand"
  "os"
  "time"
  "k8s.io/component-base/logs"
  "k8s.io/kubernetes/cmd/kube-scheduler/app"
  "pkg/hhewaterlevelbalance"
  "pkg/plugin2"
)
func main() {
  rand.Seed(time.Now().UnixNano())
  command := app.NewSchedulerCommand(
    app.WithPlugin(hhewaterlevelbalance.Name, HheWaterLevelBalance.New),
    app.WithPlugin(pugin2.Name, pugin2.New),
  )
  logs.InitLogs()
  defer logs.FlushLogs()
  if err := command.Execute(); err != nil {
    os.Exit(1)
  }
}

After deployment, monitoring graphs show that before enabling the plugin the Node water‑level deviation exceeds 50 %, while after a period of operation the deviation drops to around 15 %.

Conclusion :

Node and Pod water‑level acquisition currently relies on annotations; future work should unify this via the Kubernetes Metrics Server.

Time‑series variables could enable tidal mixing to improve utilization during low‑traffic periods.

Water‑level balancing markedly improves cluster stability and, together with autoscaling and request/limit prediction, helps reduce costs.

Performance Kubernetes plugin Go scheduler WaterLevelBalancing

Written by

HelloTech

Official Hello technology account, sharing tech insights and developments.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.