Cloud Native 18 min read

How Koordinator Boosts CPU Utilization and Cuts Costs in Large‑Scale Mixed Workloads

Koordinator, an open‑source cloud‑native mixed‑workload scheduler born from Alibaba’s internal container orchestration experience, enables Xiaohongshu to reclaim idle resources, improve CPU utilization beyond 45%, reduce resource costs by millions of core‑hours, and seamlessly integrate Kubernetes with YARN for batch and AI workloads.

Alibaba Cloud Native

Nov 24, 2023

How Koordinator Boosts CPU Utilization and Cuts Costs in Large‑Scale Mixed Workloads

Background

As Xiaohongshu’s business rapidly expands, online and offline services demand ever‑greater compute resources, while many online clusters suffer low daily CPU utilization due to tidal usage patterns, exclusive resource pools, and over‑provisioning for stability.

Since 2022 the Xiaohongshu container team has applied large‑scale mixed‑workload techniques to raise cluster efficiency and lower resource costs.

Technical Evolution – Four Stages

Stage 1: Reuse of Idle Resources

Early clusters contained many low‑utilization nodes because business‑exclusive pools fragmented resources. By deploying virtual‑kubelet to aggregate idle nodes from a metadata cluster and exposing them to transcoding workloads, the platform can allocate spare CPU to offline jobs while ensuring online pods are immediately evicted when needed.

Stage 2: Whole‑Machine Time‑Slice Reuse

During off‑peak hours, the platform scales down online services using HPA, frees whole machines, and runs offline pods (transcoding, training, etc.) on the vacated capacity. An offline‑pre‑emptive exit strategy and scheduler‑level pre‑emption guarantee that online services can be fully restored before peak traffic.

Stage 3: Continuous Mixed‑Workload

To reduce resource fragmentation, business workloads are migrated from exclusive pools into a shared public pool. Dynamic over‑commitment, resource‑view abstraction, and fine‑grained scheduling policies (dynamic over‑sell, load‑aware scheduling, hotspot eviction) improve overall CPU allocation while still handling night‑time low utilization.

Stage 4: Fine‑Grained QoS and Interference Management

Both scheduler and node sides implement QoS guarantees: per‑pod QoS levels (latency‑sensitive, mid, batch), CPU burst enable/disable, priority weighting, NUMA awareness, cache and memory bandwidth caps, and OOM priority tuning. Node‑side also provides QoS‑aware eviction, interference detection, and resource reclamation.

Architecture Design and Implementation

The overall resource‑scheduling architecture consists of a unified scheduler that receives workloads from various publishing platforms and dispatches them as pods. Scheduler‑side features include:

Offline co‑scheduling

Secondary scheduling for hotspot eviction and fragment consolidation

Load‑aware scheduling based on CPU watermarks

Simulated resource view for offline jobs

Node‑side capabilities include:

QoS enforcement (core binding, memory bandwidth, cache allocation)

Compression (BVT) and memory eviction policies

Batch resource reporting and kernel‑level metric collection (psi, sched info)

Interference detection using CPI, PSI, and business metrics

Offline Scheduling Resource View

Offline‑available resources are calculated as:

OfflineAvailable = TotalNodeResources – ReservedResources – OnlineActualUsage

After smoothing noisy usage data, a relatively stable offline resource estimate is derived and visualized (green area in the original charts).

QoS Levels

Workloads are classified into three QoS tiers:

latency‑sensitive : highest guarantee for latency‑critical services (e.g., search promotion)

mid : default level tolerating some interference (e.g., gateways, Java micro‑services)

batch : lowest level for non‑latency‑critical batch jobs (e.g., transcoding, Spark, Flink, training)

QoS Guarantees

CPU policies include burst enablement, priority weighting, core binding types (exclusive, share, reclaimed), NUMA preferences, cache and memory bandwidth caps. Memory policies adjust OOM priority and reclamation thresholds per QoS tier.

Core Binding Types

exclusive (not recommended): full cpuset binding, CCD awareness, NUMA binding, exclusive use – for ultra‑sensitive services.

share : cpuset binding with optional NUMA, can coexist with other workloads – for typical micro‑services.

reclaimed : no cpuset binding, kernel decides core allocation, suitable for batch jobs.

Offline Eviction

When a node’s memory pressure or prolonged offline CPU starvation occurs, the node‑side arbitrates based on offline priority, resource consumption, and runtime duration, then evicts lower‑priority offline pods.

Mixed‑Workload Scenarios

Xiaohongshu runs a variety of offline jobs, including near‑real‑time transcoding, Flink streaming/batch, Spark on YARN, CV/NLP inference, and training workloads, all containerized except Spark on YARN.

By exposing a unified K8s‑based offline scheduling layer, these jobs share the same compute pool with online services, receiving differentiated QoS while maximizing overall resource efficiency.

K8s + YARN Mixed‑Workload Solution

To alleviate Spark job queuing on YARN while leveraging idle capacity in online clusters, Xiaohongshu adopted a K8s‑on‑YARN hybrid approach using the Koordinator community’s koord‑yarn‑operator for bidirectional resource view synchronization.

Key components:

Scheduler side : koord‑yarn‑operator synchronizes offline resource totals between K8s and YARN.

Node side : copilot (NodeManager proxy) and Neptune‑agent/koordlet handle Yarn task control, offline pod management, conflict resolution, and pre‑emptive eviction.

Resource‑sync formulas:

K8s → YARN: YARNOfflineTotal = OfflineAvailable – K8sAllocated YARN → K8S: K8sOfflineTotal = OfflineAvailable – YARNAllocated Both schedulers make independent decisions based on their local offline resource view, with arbitration logic on the node side to prevent over‑allocation.

Community Contributions and Benefits

Since its open‑source launch in April 2022, Koordinator has attracted many contributors. Xiaohongshu has been an active community member, delivering runtime‑proxy, YARN‑K8s integration, and large‑scale deployments.

Results to date:

CPU utilization of mixed clusters exceeds 45% (some clusters reach 55%).

Offline mixing raises online CPU usage by 8‑15% and storage‑cluster usage by over 20%.

Provides millions of core‑hours of low‑cost compute for offline jobs.

CPU allocation rates surpass 125%, dramatically reducing resource fragmentation.

Future Roadmap

Upcoming goals focus on hybrid‑cloud unified scheduling, further resource‑efficiency gains, and stronger QoS guarantees:

Support mixed‑workload scheduling for big data and AI tasks across hybrid clouds.

Advance resource pooling, quota‑based delivery, and aggressive over‑sell techniques to push utilization higher and cut costs.

Develop QoS‑aware scheduling, interference detection, and secure container mechanisms to address deep‑mixing challenges.

Koordinator Community Plans

Future releases will prioritize:

Scheduler performance optimizations (equivalence class scheduling).

Network QoS (bandwidth guarantees, request/limit model).

Big‑data workload support (Gang scheduling, Hadoop YARN QoS adaptation).

Resource interference detection using low‑level metrics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native resource optimization YARN mixed workloads container scheduling

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Technical Evolution – Four Stages

Stage 1: Reuse of Idle Resources

Stage 2: Whole‑Machine Time‑Slice Reuse

Stage 3: Continuous Mixed‑Workload

Stage 4: Fine‑Grained QoS and Interference Management

Architecture Design and Implementation

Offline Scheduling Resource View

QoS Levels

QoS Guarantees

Core Binding Types

Offline Eviction

Mixed‑Workload Scenarios

K8s + YARN Mixed‑Workload Solution

Community Contributions and Benefits

Future Roadmap

Koordinator Community Plans

Alibaba Cloud Native

How this landed with the community

Was this worth your time?

0 Comments

Stage 1: Reuse of Idle Resources

Stage 2: Whole‑Machine Time‑Slice Reuse

Stage 3: Continuous Mixed‑Workload

Stage 4: Fine‑Grained QoS and Interference Management

K8s + YARN Mixed‑Workload Solution