Cloud Native 14 min read

Unified Scheduling Access, Algorithm Enhancements, and Performance Optimizations in Ctrip's Cloud Container K8s Platform

This article details Ctrip Cloud Container's practical experience in building a unified, policy‑driven scheduling framework for Kubernetes, covering algorithm parameterization, affinity configuration, extended resource‑balancing, load‑aware scoring, and performance tuning that together raise scheduling throughput by over five times in large‑scale clusters.

Ctrip Technology

Apr 9, 2020

Unified Scheduling Access, Algorithm Enhancements, and Performance Optimizations in Ctrip's Cloud Container K8s Platform

The authors, members of Ctrip's Cloud Container team, describe how their Kubernetes container service, which runs over 100,000 containers and grows rapidly, evolved from a custom Mesos scheduler to a heavily modified K8s fork to meet unified elastic scheduling goals.

Unified Scheduling Access Optimization – By abstracting common scheduling requirements into Policy objects stored in ConfigMaps or CRDs, they introduce PolicyTemplate for parameterized policies and Binding (or annotations) to associate Pods with specific policies. This enables dynamic, per‑Pod algorithm selection without restarting the scheduler.

Example policy definition:

type Policy struct {
  metav1.TypeMeta
  Predicates []PredicatePolicy
  Priorities []PriorityPolicy
  ExtenderConfigs []ExtenderConfig
  HardPodAffinitySymmetricWeight int32
  AlwaysCheckAllPredicates bool
}

Pods can specify the desired policy via an annotation such as: sched.cloud.ctrip.com/policy=xxx Algorithm Parameterization (2.1) – The native K8s scheduler loads policies only at start‑up. By adding a PolicyCacheProvider that watches ConfigMaps, the team enables hot‑updates and per‑Pod algorithm set selection.

Affinity Parameterization (2.2) – They encapsulate NodeAffinity and PodAffinity into NodeSchedulerConfig and PodSchedulerConfig CRDs, allowing business teams to set affinity rules via simple annotations while platform operators retain control.

apiVersion: sched.cloud.ctrip.com/v1
kind: NodeSchedulerConfig
metadata:
  name: resource-pool-example
spec:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: cloud.ctrip.com/pool
          operator: In
          values:
          - {{.PrefferredPool}}
      - matchExpressions:
        - key: cloud.ctrip.com/pool
          operator: In
          values:
          - {{.DefaultPool}}
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 3
      preference:
        matchExpressions:
        - key: cloud.ctrip.com/pool
          operator: In
          values:
          - {{.PrefferredPool}}
    - weight: 2
      preference:
        matchExpressions:
        - key: cloud.ctrip.com/pool
          operator: In
          values:
          - {{.DefaultPool}}...

Binding annotation example:

scb.sched.cloud.ctrip.com/nsc=resource-pool-example
scb.sched.cloud.ctrip.com/nsc-args='{"DefaultPool":"B","PrefferredPool":"A"}'

Algorithm Optimizations

3.1 Extended Resource‑Balancing Algorithm – The native balanced_allocation only considers CPU, memory, and volume. The team extends it to include GPU and other dimensions with configurable weights.

3.2 Water‑Level‑Aware Stacking/Dispersal – By monitoring overall cluster resource utilization, the scheduler switches between most_requested (stacking) and least_requested (dispersal) strategies to better utilize fragmented resources, achieving up to 98% allocation under high load.

3.3 Runtime Load‑Aware Scoring – Introduces most_available which scores nodes using VirtualAvailable (cached runtime resources) and VirtualRequest (estimated actual usage), reducing hotspot formation during burst scheduling.

Performance Optimizations – The team rewrote the upstream performance test harness, added Prometheus instrumentation, and refactored the inter‑pod affinity/anti‑affinity scoring to a two‑phase approach, dramatically reducing computational complexity. The optimized path lowers the algorithmic cost from O(Np·Nc·…) to a near‑linear factor, yielding a 5× throughput increase in a 300‑node, 15k‑Pod test cluster. Several patches were upstreamed (e.g., PR #79474, #84264, #80018, #79465, #79774).

Future Plans – Continue expanding the scheduler with more resource dimensions (disk I/O, network I/O, NUMA, HT), explore global‑optimal and re‑scheduling techniques, improve infrastructure elasticity, and enhance mixed‑workload placement and service‑quality evaluation.

Recommended reading links are provided at the end of the original article.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance kubernetes Scheduler resource optimization

Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.