Exploring and Implementing Elastic Scheduling for Xiaomi Hadoop YARN
Shilong Fei from Xiaomi Data Platform presents an in‑depth exploration of elastic scheduling for Hadoop YARN, covering background, design of resource pools, auto‑scaling architecture, challenges such as job stability and user transparency, achieved cost reductions, and future plans for further optimization.
Exploring and Implementing Elastic Scheduling for Xiaomi Hadoop YARN
Speaker: Shilong Fei, Xiaomi Group Editor: Liu Zhaolei, Zaozhuang College Platform: DataFunTalk
1. Background
The offline cluster shows low average utilization and a clear daily peak from 0–12 h, followed by a trough. As Xiaomi’s business expands, the demand for compute resources grows, making cost‑effective resource scheduling essential.
2. Goals
Eliminate idle resources during low‑load periods.
Utilize cheaper compute instances.
Maintain job stability while achieving the above.
3. Elastic Scheduling Overview
Elastic scheduling is built on Hadoop YARN and dynamically expands or shrinks the cluster based on one or more resource pools.
3.1 Hadoop YARN Architecture
YARN follows a master‑worker model: ResourceManager manages resources and scheduling, while each NodeManager runs tasks on a compute node.
3.2 Elastic Resource Pools
Three types of resources are used:
Public‑cloud on‑demand instances (OD)
Public‑cloud spot instances (Spot)
Internal online‑business machines
The ideal strategy is to use all idle internal machines first, supplement any shortage with spot instances, and keep on‑demand instances as a safety net.
3.3 Elastic Scheduling Definition
Elastic scheduling dynamically adjusts YARN cluster size according to predefined elasticity strategies and resource‑pool policies.
4. Overall Architecture
The right side shows a standard YARN cluster (ResourceManager + NodeManagers). The left side adds the elastic‑scheduling components.
4.1 AutoScaling Module
AutoScaling monitors cluster metrics, decides when to request or release nodes, and writes nodes to be removed into an exclude file. ResourceManager then marks those nodes as Decommissioning . Once the nodes are fully decommissioned, AutoScaling releases them, ensuring running tasks are not interrupted.
Core Components
ClusterMonitor: Monitors ResourceManager status.
ScalingStrategy: Generates scaling decisions (actions and resource amounts).
SpotManager: Receives scaling results and invokes the appropriate Scaler .
Scaler: Executes scaling actions for different resource pools (e.g., AWS, Alibaba Cloud, K8s).
5. Elastic Strategies
Two strategies are used:
Static historical load: Uses past load curves to predict required resources at each time point. Simple and stable but cannot handle sudden spikes.
Dynamic demand: Considers pending resources, logical load, and node utilization to decide scaling actions. More flexible but more complex and less proven for stability.
Currently the production cluster uses the static strategy; the dynamic one is under testing.
6. Challenges
6.1 Job Stability
Introducing elastic nodes can affect long‑running jobs (e.g., Flink, Spark) because nodes may be reclaimed. To isolate resources, YARN Label is used: default nodes get one label, elastic nodes another. Jobs can then request specific labels.
Label expressions such as ||spot||od allow a job to accept any of default, spot, or on‑demand resources, solving the isolation problem.
6.2 Node Smooth Decommission
When a node is to be removed, it is first marked Decommissioning and allowed to finish all running tasks before actual shutdown, greatly improving job stability.
6.3 Spark Specific Optimizations
To avoid data loss when nodes disappear, Spark Remote Shuffle Service stores shuffle data externally. Additionally, Spark jobs are made aware of node status so failed tasks on decommissioned nodes are not counted as true failures.
7. User Transparency
Elastic scheduling is invisible to users. Queues are automatically split into default (e.g., 60 units) and elastic (e.g., 40 units) resources based on historical usage, and jobs are automatically labeled or excluded without manual configuration.
8. Results (Elastic Outcomes)
After applying elastic scheduling, the usable resource line (green) rises above the static baseline (blue), indicating that idle resources have been reclaimed. Using on‑demand instances saves ~12% of cost, while spot instances can achieve up to 20% savings. The current plan is to shift more workload to spot instances to reach the 20% target.
9. Future Plans
Planned improvements include tighter integration of AutoScaling with ResourceManager, more precise elasticity strategies, better node‑down selection, leveraging idle online resources for batch workloads, and addressing cross‑region bandwidth challenges when moving to public cloud.
10. Contact
Email: [email protected] Application title: Name‑Xiaomi‑BigData‑Distributed‑Scheduling‑Application
Thank you for reading.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.