Big Data 20 min read

Practical Optimizations and Resource Management of Hadoop YARN at Xiaomi

This article shares Xiaomi's internal practices of Hadoop YARN, covering scheduling and resource optimization, elastic scheduling, node overcommit handling, federation architecture, metadata warehouse construction, and future plans to improve cluster utilization and cost efficiency.

DataFunSummit
DataFunSummit
DataFunSummit
Practical Optimizations and Resource Management of Hadoop YARN at Xiaomi

Speaker: Tu Yu, Senior Software Development Engineer at Xiaomi; Editor: Yang Qi, Shaanxi University of Science and Technology; Platform: DataFunTalk.

Introduction – Big data technologies, especially Hadoop YARN, are mature across industries. The article presents Xiaomi's internal experience with Hadoop YARN.

1. Scheduling Optimization Practices

Xiaomi operates over 20 clusters, the largest with more than 6000 nodes and 1000 queues. Issues identified include scheduling stalls caused by resource updates locking queues, Global Scheduler thread crashes due to TimeSort violations, and ResourceLimit mis‑calculations.

Solutions:

Asynchronous batch updates for node resource reports to avoid lock contention.

Replace TimeSort with legacyMergeSort via JVM flag -Djava.util.Arrays.useLegacyMergeSort=true or deep‑copy queue data before sorting.

Use Resources.componentwiseMin to compute ResourceLimit, preventing invalid resource requests.

2. Performance Optimization

Skip user‑limit calculations when not needed, and allocate multiple containers per scheduling round to increase throughput.

3. Resource Optimization Practices

Elastic Scheduling – During low‑peak periods, elastic nodes are used; jobs are filtered by runtime and priority to run on elastic resources without affecting stability. Graceful decommission is applied before node shutdown.

Node Resource Over‑commit – Nodes report resources larger than physical memory (real + overuse). Stability is ensured via Cgroup settings (memory.limit_in_bytes, oom_kill_disable) and Elastic Memory Controller that freezes the process tree or kills low‑priority containers when over‑use is detected.

Baseline Task Execution Optimization – Baseline jobs are prioritized using a fair + priority comparator, allowing high‑priority tasks to acquire resources faster while maintaining fairness.

4. Yarn Federation and Other Optimizations

To improve scalability, Xiaomi adopts Yarn Federation with a resource management service for rule configuration. Issues such as AM Invalid Token are solved by conditional token replacement via environment variables.

Additional optimizations include persisting finished app state to MySQL to reduce RM memory pressure, extending Yarn UI with thread‑dump links, container hook checks, and adjusting log reporting to occur at job finish.

5. Yarn Metadata Warehouse

The metadata warehouse aggregates data on app events, resource usage, and trends, stored in Iceberg tables and processed with FlinkSQL, providing dashboards for capacity planning and cost analysis.

6. Future Planning

Plans involve mixed online/offline workloads, dynamic over‑commit, and unified job and resource scheduling to further boost cluster efficiency.

Conclusion – Xiaomi’s internal YARN enhancements address performance, scalability, and resource utilization, delivering cost savings and higher reliability.

performance optimizationBig DataResource SchedulingYARNHadoopelastic scheduling
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.