Big Data 20 min read

Exploring and Implementing Elastic Scheduling for Xiaomi Hadoop YARN

Shilong Fei from Xiaomi Data Platform presents an in‑depth exploration of elastic scheduling for Hadoop YARN, covering background, design of resource pools, auto‑scaling architecture, challenges such as job stability and user transparency, achieved cost reductions, and future plans for further optimization.

DataFunSummit
DataFunSummit
DataFunSummit
Exploring and Implementing Elastic Scheduling for Xiaomi Hadoop YARN

Exploring and Implementing Elastic Scheduling for Xiaomi Hadoop YARN

Speaker: Shilong Fei, Xiaomi Group Editor: Liu Zhaolei, Zaozhuang College Platform: DataFunTalk

1. Background

The offline cluster shows low average utilization and a clear daily peak from 0–12 h, followed by a trough. As Xiaomi’s business expands, the demand for compute resources grows, making cost‑effective resource scheduling essential.

2. Goals

Eliminate idle resources during low‑load periods.

Utilize cheaper compute instances.

Maintain job stability while achieving the above.

3. Elastic Scheduling Overview

Elastic scheduling is built on Hadoop YARN and dynamically expands or shrinks the cluster based on one or more resource pools.

3.1 Hadoop YARN Architecture

YARN follows a master‑worker model: ResourceManager manages resources and scheduling, while each NodeManager runs tasks on a compute node.

3.2 Elastic Resource Pools

Three types of resources are used:

Public‑cloud on‑demand instances (OD)

Public‑cloud spot instances (Spot)

Internal online‑business machines

The ideal strategy is to use all idle internal machines first, supplement any shortage with spot instances, and keep on‑demand instances as a safety net.

3.3 Elastic Scheduling Definition

Elastic scheduling dynamically adjusts YARN cluster size according to predefined elasticity strategies and resource‑pool policies.

4. Overall Architecture

The right side shows a standard YARN cluster (ResourceManager + NodeManagers). The left side adds the elastic‑scheduling components.

4.1 AutoScaling Module

AutoScaling monitors cluster metrics, decides when to request or release nodes, and writes nodes to be removed into an exclude file. ResourceManager then marks those nodes as Decommissioning . Once the nodes are fully decommissioned, AutoScaling releases them, ensuring running tasks are not interrupted.

Core Components

ClusterMonitor: Monitors ResourceManager status.

ScalingStrategy: Generates scaling decisions (actions and resource amounts).

SpotManager: Receives scaling results and invokes the appropriate Scaler .

Scaler: Executes scaling actions for different resource pools (e.g., AWS, Alibaba Cloud, K8s).

5. Elastic Strategies

Two strategies are used:

Static historical load: Uses past load curves to predict required resources at each time point. Simple and stable but cannot handle sudden spikes.

Dynamic demand: Considers pending resources, logical load, and node utilization to decide scaling actions. More flexible but more complex and less proven for stability.

Currently the production cluster uses the static strategy; the dynamic one is under testing.

6. Challenges

6.1 Job Stability

Introducing elastic nodes can affect long‑running jobs (e.g., Flink, Spark) because nodes may be reclaimed. To isolate resources, YARN Label is used: default nodes get one label, elastic nodes another. Jobs can then request specific labels.

Label expressions such as ||spot||od allow a job to accept any of default, spot, or on‑demand resources, solving the isolation problem.

6.2 Node Smooth Decommission

When a node is to be removed, it is first marked Decommissioning and allowed to finish all running tasks before actual shutdown, greatly improving job stability.

6.3 Spark Specific Optimizations

To avoid data loss when nodes disappear, Spark Remote Shuffle Service stores shuffle data externally. Additionally, Spark jobs are made aware of node status so failed tasks on decommissioned nodes are not counted as true failures.

7. User Transparency

Elastic scheduling is invisible to users. Queues are automatically split into default (e.g., 60 units) and elastic (e.g., 40 units) resources based on historical usage, and jobs are automatically labeled or excluded without manual configuration.

8. Results (Elastic Outcomes)

After applying elastic scheduling, the usable resource line (green) rises above the static baseline (blue), indicating that idle resources have been reclaimed. Using on‑demand instances saves ~12% of cost, while spot instances can achieve up to 20% savings. The current plan is to shift more workload to spot instances to reach the 20% target.

9. Future Plans

Planned improvements include tighter integration of AutoScaling with ResourceManager, more precise elasticity strategies, better node‑down selection, leveraging idle online resources for batch workloads, and addressing cross‑region bandwidth challenges when moving to public cloud.

10. Contact

Email: [email protected] Application title: Name‑Xiaomi‑BigData‑Distributed‑Scheduling‑Application

Thank you for reading.

Big DataResource ManagementCost Optimizationauto scalingYARNHadoopelastic scheduling
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.