How Koordinator Boosts CPU Utilization and Cuts Costs in Large‑Scale Mixed Workloads
Koordinator, an open‑source cloud‑native mixed‑workload scheduler born from Alibaba’s internal container orchestration experience, enables Xiaohongshu to reclaim idle resources, improve CPU utilization beyond 45%, reduce resource costs by millions of core‑hours, and seamlessly integrate Kubernetes with YARN for batch and AI workloads.
Background
As Xiaohongshu’s business rapidly expands, online and offline services demand ever‑greater compute resources, while many online clusters suffer low daily CPU utilization due to tidal usage patterns, exclusive resource pools, and over‑provisioning for stability.
Since 2022 the Xiaohongshu container team has applied large‑scale mixed‑workload techniques to raise cluster efficiency and lower resource costs.
Technical Evolution – Four Stages
Stage 1: Reuse of Idle Resources
Early clusters contained many low‑utilization nodes because business‑exclusive pools fragmented resources. By deploying virtual‑kubelet to aggregate idle nodes from a metadata cluster and exposing them to transcoding workloads, the platform can allocate spare CPU to offline jobs while ensuring online pods are immediately evicted when needed.
Stage 2: Whole‑Machine Time‑Slice Reuse
During off‑peak hours, the platform scales down online services using HPA, frees whole machines, and runs offline pods (transcoding, training, etc.) on the vacated capacity. An offline‑pre‑emptive exit strategy and scheduler‑level pre‑emption guarantee that online services can be fully restored before peak traffic.
Stage 3: Continuous Mixed‑Workload
To reduce resource fragmentation, business workloads are migrated from exclusive pools into a shared public pool. Dynamic over‑commitment, resource‑view abstraction, and fine‑grained scheduling policies (dynamic over‑sell, load‑aware scheduling, hotspot eviction) improve overall CPU allocation while still handling night‑time low utilization.
Stage 4: Fine‑Grained QoS and Interference Management
Both scheduler and node sides implement QoS guarantees: per‑pod QoS levels (latency‑sensitive, mid, batch), CPU burst enable/disable, priority weighting, NUMA awareness, cache and memory bandwidth caps, and OOM priority tuning. Node‑side also provides QoS‑aware eviction, interference detection, and resource reclamation.
Architecture Design and Implementation
The overall resource‑scheduling architecture consists of a unified scheduler that receives workloads from various publishing platforms and dispatches them as pods. Scheduler‑side features include:
Offline co‑scheduling
Secondary scheduling for hotspot eviction and fragment consolidation
Load‑aware scheduling based on CPU watermarks
Simulated resource view for offline jobs
Node‑side capabilities include:
QoS enforcement (core binding, memory bandwidth, cache allocation)
Compression (BVT) and memory eviction policies
Batch resource reporting and kernel‑level metric collection (psi, sched info)
Interference detection using CPI, PSI, and business metrics
Offline Scheduling Resource View
Offline‑available resources are calculated as:
OfflineAvailable = TotalNodeResources – ReservedResources – OnlineActualUsageAfter smoothing noisy usage data, a relatively stable offline resource estimate is derived and visualized (green area in the original charts).
QoS Levels
Workloads are classified into three QoS tiers:
latency‑sensitive : highest guarantee for latency‑critical services (e.g., search promotion)
mid : default level tolerating some interference (e.g., gateways, Java micro‑services)
batch : lowest level for non‑latency‑critical batch jobs (e.g., transcoding, Spark, Flink, training)
QoS Guarantees
CPU policies include burst enablement, priority weighting, core binding types (exclusive, share, reclaimed), NUMA preferences, cache and memory bandwidth caps. Memory policies adjust OOM priority and reclamation thresholds per QoS tier.
Core Binding Types
exclusive (not recommended): full cpuset binding, CCD awareness, NUMA binding, exclusive use – for ultra‑sensitive services.
share : cpuset binding with optional NUMA, can coexist with other workloads – for typical micro‑services.
reclaimed : no cpuset binding, kernel decides core allocation, suitable for batch jobs.
Offline Eviction
When a node’s memory pressure or prolonged offline CPU starvation occurs, the node‑side arbitrates based on offline priority, resource consumption, and runtime duration, then evicts lower‑priority offline pods.
Mixed‑Workload Scenarios
Xiaohongshu runs a variety of offline jobs, including near‑real‑time transcoding, Flink streaming/batch, Spark on YARN, CV/NLP inference, and training workloads, all containerized except Spark on YARN.
By exposing a unified K8s‑based offline scheduling layer, these jobs share the same compute pool with online services, receiving differentiated QoS while maximizing overall resource efficiency.
K8s + YARN Mixed‑Workload Solution
To alleviate Spark job queuing on YARN while leveraging idle capacity in online clusters, Xiaohongshu adopted a K8s‑on‑YARN hybrid approach using the Koordinator community’s koord‑yarn‑operator for bidirectional resource view synchronization.
Key components:
Scheduler side : koord‑yarn‑operator synchronizes offline resource totals between K8s and YARN.
Node side : copilot (NodeManager proxy) and Neptune‑agent/koordlet handle Yarn task control, offline pod management, conflict resolution, and pre‑emptive eviction.
Resource‑sync formulas:
K8s → YARN: YARNOfflineTotal = OfflineAvailable – K8sAllocated YARN → K8S: K8sOfflineTotal = OfflineAvailable – YARNAllocated Both schedulers make independent decisions based on their local offline resource view, with arbitration logic on the node side to prevent over‑allocation.
Community Contributions and Benefits
Since its open‑source launch in April 2022, Koordinator has attracted many contributors. Xiaohongshu has been an active community member, delivering runtime‑proxy, YARN‑K8s integration, and large‑scale deployments.
Results to date:
CPU utilization of mixed clusters exceeds 45% (some clusters reach 55%).
Offline mixing raises online CPU usage by 8‑15% and storage‑cluster usage by over 20%.
Provides millions of core‑hours of low‑cost compute for offline jobs.
CPU allocation rates surpass 125%, dramatically reducing resource fragmentation.
Future Roadmap
Upcoming goals focus on hybrid‑cloud unified scheduling, further resource‑efficiency gains, and stronger QoS guarantees:
Support mixed‑workload scheduling for big data and AI tasks across hybrid clouds.
Advance resource pooling, quota‑based delivery, and aggressive over‑sell techniques to push utilization higher and cut costs.
Develop QoS‑aware scheduling, interference detection, and secure container mechanisms to address deep‑mixing challenges.
Koordinator Community Plans
Future releases will prioritize:
Scheduler performance optimizations (equivalence class scheduling).
Network QoS (bandwidth guarantees, request/limit model).
Big‑data workload support (Gang scheduling, Hadoop YARN QoS adaptation).
Resource interference detection using low‑level metrics.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
