Big Data 16 min read

Exploring and Implementing Elastic Scheduling for Xiaomi Hadoop YARN

This talk presents Xiaomi's design and deployment of an elastic scheduling system for Hadoop YARN, covering background analysis, resource‑pool strategy, auto‑scaling architecture, stability challenges, label‑based resource isolation, Spark shuffle handling, cost‑saving results and future plans.

DataFunTalk
DataFunTalk
DataFunTalk
Exploring and Implementing Elastic Scheduling for Xiaomi Hadoop YARN

Speaker Shi Longfei from Xiaomi Data Platform introduces the motivation behind elastic scheduling for their offline Hadoop YARN clusters, noting low average utilization and the need for cost‑effective scaling as business workloads grow.

The proposed solution defines three resource pools: internal online machines (zero cost), public‑cloud on‑demand instances, and cheaper spot instances. An optimal strategy uses internal resources first, supplements gaps with spot instances, and falls back to on‑demand instances when spot capacity is unavailable.

Elastic scheduling is defined as dynamic YARN cluster expansion and contraction based on one or more resource pools and a set of elasticity policies. The architecture adds an AutoScaling module that monitors ResourceManager, decides scaling actions via a ScalingStrategy, and executes them through a SpotManager and Scaler adapters for various clouds or Kubernetes.

Key challenges include maintaining job stability and providing a transparent experience for users. Stability is addressed by combining YARN labels with elastic resources to isolate default and elastic nodes, introducing label expressions (e.g., "||spot||od") to allow flexible resource selection, and implementing smooth node decommissioning that waits for running tasks to finish before releasing nodes.

Additional optimizations target Spark jobs: a Remote Shuffle Service preserves shuffle data across node removals, and task failure handling ignores node‑failure‑induced task failures to prevent job aborts.

Results show a reduction of idle resources, achieving 12% cost savings with on‑demand instances (25% of total capacity) and up to 20% when fully switching to spot instances. Future work focuses on tighter integration with ResourceManager, refined elasticity policies, better utilization of online resources, and addressing cross‑region bandwidth constraints.

big dataResource ManagementautoscalingYARNHadoopelastic scheduling
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.