Boost Cluster Efficiency with Koordinator’s K8s‑YARN Co‑Location Solution
Koordinator extends its open‑source container scheduler to enable seamless co‑location of Kubernetes Pods and Hadoop YARN tasks, allowing over‑provisioned batch resources to be shared without modifying YARN, and has delivered up to 10 % CPU utilization gains and sub‑1 % eviction rates in Xiaohongshu’s production clusters.
Background
Koordinator is an open‑source project from Alibaba that originally focused on container scheduling within the Kubernetes ecosystem. While many workloads have moved to K8s, a large number of big‑data jobs still run on Apache Hadoop YARN, and cloud providers continue to offer YARN‑based services such as E‑MapReduce.
Motivation and Community Effort
To extend Koordinator’s offline co‑location capabilities, developers from Alibaba Cloud, Xiaohongshu, and Ant Financial launched a joint Hadoop YARN‑K8s co‑location project. The solution enables over‑provisioned batch resources to be shared with YARN, and it is already deployed in Xiaohongshu’s production environment.
Design Principles
YARN remains the submission entry for offline jobs.
The solution builds on the open‑source Hadoop YARN without invasive modifications.
Co‑located resources can be consumed by both K8s Pods and YARN tasks on the same node.
QoS policies are managed by Koordlet and are compatible with YARN task runtime.
Architecture
ResourceManager (RM) and NodeManager (NM) stay as core YARN components; NM runs as a container in the mixed environment. Koordinator adds a koord‑yarn‑operator to synchronize batch resource quotas to the YARN RM. Resource isolation is enforced via cgroup paths under the besteffort QoS class.
A sidecar module koord‑yarn‑copilot collects task metadata, resource metrics, and performs eviction actions. QoS strategies remain in Koordlet and are exposed to the copilot through a plugin interface, preserving extensibility for future resource frameworks.
Production Experience at Xiaohongshu
Facing heavy Spark workloads that congested offline clusters, Xiaohongshu leveraged the co‑location solution to keep the YARN submission interface unchanged while moving tasks onto idle online resources. Key techniques included RemoteShuffleService to mitigate local‑disk bottlenecks and fine‑grained priority and QoS policies for different job types.
Results: coverage of tens of thousands of online nodes providing hundreds of thousands of CPU cores, offline task eviction rate below 1 %, and an average CPU utilization increase of 8‑10 % (some nodes exceeding 45 %). The benefits continue to grow as more workloads are added.
Getting Started
The K8s‑YARN co‑location features are near completion; the Koordinator team is preparing the final release. Interested contributors can join the discussion at the community forum and follow the design documentation for detailed implementation steps.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
