Bilibili's YARN Scheduling Optimization Practice: From Heartbeat-Driven to Global Scheduling
Bilibili transformed its YARN CapacityScheduler from a heartbeat‑driven design to a multi‑threaded global scheduler by separating lock handling, adopting Weighted Round‑Robin with DRF, adding batch node selection, fixing proposal inconsistencies, tuning GC and logging, and thereby reduced application allocation time by about 38 % on clusters of up to 8,000 nodes.
This article introduces Bilibili's (B站) practical implementation and optimization of YARN scheduling system. B站's YARN is built on community version 2.8.4 with CapacityScheduler, supporting offline business, real-time business, and AI training tasks. With business growth, the cluster scale reached approximately 8,000 nodes with a single cluster of 4,000+ nodes, handling 200,000 to 300,000 daily Applications.
Performance Evaluation Metrics:
At the Application level, the allocation time is defined as the sum of all Container allocation times for an App. The optimization goal is to minimize the ratio of App allocation time to runtime.
At the cluster level, the article defines a "validSchedule" metric based on three variables: pending resources, cluster resource utilization (clusterUsedR), and queue resource utilization (queueUsedR). When resources are pending but cluster utilization is below threshold (e.g., 90%) and queue utilization is below threshold (e.g., 95%), the scheduling performance cannot meet the queue's resource requirements.
Core Scheduling Flow Optimization:
The original scheduling process has two issues: the scheduler lock holds during the entire node/queue/App selection process, and multiple object sorting operations are involved. The optimization separates the lock from core resource changes and optimizes sorting logic.
WRR Ordering Policy:
For App-level scheduling, Bilibili implemented Weighted Round Robin (WRR) instead of Strict Priority (SP). Priority 5 Apps use strict SP with FIFO within the priority, while priorities 0-4 use WRR with DRF (Dominant Resource Fairness) as the comparator. An App allocation waiting factor was introduced to prevent starvation.
Global Scheduling Implementation:
When the cluster exceeds 4,000 nodes, the default Heartbeat-Driven Scheduler could not fully utilize the cluster. Bilibili adopted Global Scheduling with two versions: v1 focuses on multi-threaded scheduling, while v2 adds batch node selection (PlacementSet).
The architecture separates "Proposal generation" and "Proposal consumption" into different threads. AsyncScheduleThread handles concurrent Proposal generation, while resourceCommitterService handles single-threaded consumption.
Problems and Solutions:
1. Batch ERROR LOG issue: Resolved by adding a check in tryCommit.accept() to reject Proposals when the application is already completed.
2. Resource calculation inconsistency: The Proposal generation and consumption used different calculation methods (greaterThanOrEqual vs fitsIn), causing many Proposal failures. Fixed by aligning the resource calculation logic.
Key Optimizations:
1. Log level adjustment from INFO to DEBUG for high-frequency logs
2. Backlogs length set to 1,000 with 4 concurrent threads after tuning
3. Reserve Limit feature to control single App's reservation上限
4. Event Queue printing logic moved to a separate thread
5. GC optimization: Switched from G1GC to ParNew + CMS with reconfigured generation ratios
6. Added comprehensive monitoring metrics for Proposal status
Results:
Under test conditions of 10,000 nodes running 100 Jobs with 10,000 Containers each, Heartbeat scheduling averaged 3min 15s while Global Scheduling averaged 2min, representing a 38% improvement. The validSchedule metric also showed significant improvement.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.