Big Data 22 min read

Amiya: Dynamic Overcommit Component for Bilibili Offline Big Data Cluster Resource Scheduling

This article introduces Amiya, a self‑developed overcommit component that dynamically increases Yarn memory and vCore capacity on Bilibili's offline big‑data clusters, details its architecture, key implementation of overcommit, eviction and mixed‑deployment strategies, and evaluates its resource‑utilization impact.

High Availability Architecture

May 26, 2023

Amiya: Dynamic Overcommit Component for Bilibili Offline Big Data Cluster Resource Scheduling

In the past year Bilibili's offline platform faced two major challenges: rapid expansion of the offline cluster leading to high pending tasks, and the need to improve resource utilization without adding physical machines. To address these, the team pursued internal overcommit on individual nodes and external mixed‑deployment with the cloud platform.

Amiya was built to close the resource gap. Deployed on more than 5,000 NodeManagers, it has provided roughly 683 TB of additional memory and 137 k vCores to Yarn. The component consists of several modules: AmiyaContext, StateStoreManager, CheckPointManager, NodeResourceManager, OperatorManager, InspectManager, and AuditManager, each handling environment detection, metric collection, state checkpointing, core overcommit logic, interaction with Yarn/K8s, inspection, and audit logging respectively.

The overcommit logic follows the principle that users request more resources than they actually use. NodeResourceManager reads CPU and memory usage percentages (both overall and per‑NodeManager), compares them against configurable thresholds (OverCommitThreshold, DropOffThreshold), and decides whether to increase (OverCommit), decrease (DropOff), or keep (Keep) resources. The desired change is multiplied by a change‑ratio and added to the current Yarn allocation, then subjected to three safeguards: a maximum ratio based on physical capacity, a minimum change range to filter noise, and a minimum interval between adjustments.

Resource‑limit optimization revealed that a uniform CPU/Memory ratio caused imbalance across different machine types. For 48‑core nodes (256 GB, 48 CPU) the post‑overcommit CPU utilization stabilized around 70 %, while 96‑core nodes (256 GB, 96 CPU) only reached ~30 %. By adjusting the memory overcommit factor to 1.5× for 256 GB machines and adding extra memory to 96‑core nodes, the CPU utilization rose to ~70 % and memory usage remained high, achieving a more balanced 3–4 GB per vCore ratio.

Eviction optimization is split into three levels: Container eviction (triggered after a DropOff caused by memory pressure, with priority‑aware skipping and an "ExtremeKill" fallback), Application eviction (targeting large‑disk jobs when SSD usage exceeds a threshold), and Node eviction using K8s‑style taints (OOMTaint, HighLoadTaint, HighDiskTaint, LowResourceTaint, NeedToStopTaint). These mechanisms keep the cluster stable while maintaining high overcommit rates.

In mixed‑deployment scenarios Amiya runs as a sidecar inside the NodeManager pod on Yarn‑on‑K8s. The pod’s real resource limits are passed to Amiya via a Unix domain socket; Amiya then computes the expected overcommit amount based on cgroup metrics and reports the adjusted resources back to the NodeManager, enabling dynamic per‑pod overcommit.

The impact is significant: Amiya adds about <683 TB> of memory and <137 k> vCores to the offline cluster, translates to an effective increase of 900–1 400 nodes, raises average per‑node memory usage by 15.6 % and CPU usage by 18.6 % (up to 22 % on main configurations), while keeping eviction rates between 0.56 % and 2.73 %.

In summary, Amiya now reliably supports both offline and mixed‑cluster Yarn overcommit, and future work includes kernel‑level OOM handling, smarter low‑priority eviction, and a master‑worker architecture to enable global resource profiling and more flexible overcommit policies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

resource optimization Resource Scheduling YARN cluster management Overcommit

Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.