Big Data 24 min read

Performance Optimization Practices for Meituan's Hadoop YARN Fair Scheduler

This article presents a comprehensive analysis of Meituan's Hadoop YARN fair scheduler, detailing its architecture, resource abstractions, scheduling workflow, performance bottlenecks, fine‑grained metrics, and a series of optimization techniques—including sorting improvements, job‑skip reduction, parallel queue sorting, and robust rollout strategies—to achieve high‑throughput, low‑latency scheduling for large‑scale offline, streaming, and machine‑learning workloads.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Performance Optimization Practices for Meituan's Hadoop YARN Fair Scheduler

Background

YARN is Hadoop's resource management system responsible for managing compute resources and job scheduling across a cluster. Meituan's YARN is based on community version 2.7.1 and supports offline, real‑time, and machine‑learning workloads. Offline jobs mainly run Hive on MapReduce and Spark SQL; real‑time jobs run Spark Streaming and Flink; machine‑learning jobs run TensorFlow, MXNet, and Meituan's own MLX system.

YARN faces high‑availability, scalability, and stability challenges, especially scheduler performance as cluster and workload sizes grow. For a 1,000‑node cluster with 100 CPU per node, the scheduler can only dispatch 50,000 tasks per minute, leaving 50 % of CPU resources idle during peak periods.

As the cluster scales to 5,000 nodes without optimization, resource utilization drops to 10 %, highlighting the need for systematic performance improvements.

Overall Architecture

YARN Architecture

YARN handles job resource scheduling, locating suitable resources, launching tasks, and managing job lifecycles. Detailed design follows the official Hadoop documentation.

Resource Abstraction

YARN abstracts resources in two dimensions: CPU and memory.

class Resource {
  int cpu;   // number of CPU cores
  int memory_mb; // memory in MB
}

Job requests are expressed as List[ResourceRequest] and YARN responds with List[Container].

class ResourceRequest {
  int numContainers; // number of containers needed
  Resource capability; // resources per container
}
class Container {
  ContainerId containerId; // globally unique ID
  Resource capability; // resources of this container
  String nodeHttpAddress; // NodeManager hostname where container runs
}

YARN Scheduling Architecture

Key Components

ResourceScheduler – allocates containers.

AsyncDispatcher – single‑threaded event dispatcher.

ResourceTrackerService – processes NodeManager heartbeats.

ApplicationMasterService – processes job heartbeats.

AppMaster – interacts with YARN to request/release resources.

Scheduling Process

AppMaster sends List[ResourceRequest] via heartbeat and receives already allocated List[Container].

NodeManager heartbeat triggers the scheduler to allocate containers for that node.

YARN uses pluggable schedulers; Meituan adopts the FairScheduler.

Fair Scheduler

Job Organization

Jobs (Apps) are leaf nodes in a tree‑structured queue hierarchy.

Core Scheduling Flow

Scheduler locks the FairScheduler object to avoid data‑structure conflicts.

Scheduler selects a node, traverses the queue tree from ROOT, applying the fair policy at each level, finally picking an App and allocating a suitable container.

For each queue level, the scheduler performs a pre‑check (quota), sorts child queues/apps by the fair policy, and recurses.

Example path: ROOT → ParentQueueA → LeafQueueA1 → App11, where a container is allocated to App11 on the selected node.

Pseudo‑code

class FairScheduler {
  synchronized Resource attemptScheduling(NodeId node) {
    root.assignContainer(node);
  }
}

class Queue {
  Resource assignContainer(NodeId node) {
    if (!preCheck(node)) return; // pre‑check
    sort(this.children); // sort
    if (this.isParent) {
      for (Queue q : this.children) q.assignContainer(node); // recurse
    } else {
      for (App app : this.runnableApps) app.assignContainer(node);
    }
  }
}

class App {
  Resource assignContainer(NodeId node) {
    // allocation logic
  }
}

Performance Evaluation

The fair scheduler exhibits performance issues under large‑scale pressure. Traditional online metrics (QPS, TP99) are unsuitable because scheduler RPC latency is decoupled from actual scheduling latency. Instead, we define:

Valid Schedule (validSchedule) : indicates whether the scheduler meets resource demand.

validSchedulePerMin : 1 if the scheduler meets demand in a minute, otherwise 0.

CPS (Containers per Second) : number of containers scheduled per second.

Production thresholds: validSchedulePerMin > 0.9 and validSchedulePerDay > 0.99.

Fine‑Grained Monitoring Metrics

We instrument the FairScheduler to record per‑minute time spent in critical functions such as parent queue pre‑check, sorting, child queue pre‑check, sorting, resource allocation, and idle‑job skipping.

Key Optimization Points

Optimizing Sorting Comparison Function

Sorting accounted for ~30 seconds of the 50‑second total scheduling time per minute. The original comparison repeatedly computed resourceUsage by recursively summing all descendant jobs, causing massive redundancy.

We switched to an incremental approach: when a container is allocated or released, we update the parent queues' resourceUsage by adding or subtracting the container's resources, reducing the computation to O(1).

Optimizing Job‑Skip Time

After sorting optimization, time spent skipping jobs without resource demand dropped from ~20 seconds to negligible by filtering out zero‑demand jobs before sorting.

Parallel Queue Sorting

Sorting each queue for every container allocation scales linearly with the number of queues and jobs. We decoupled sorting from the scheduling thread, using a thread pool where each thread sorts a single queue after deep‑cloning the necessary metadata.

Result: sorting 2,000 queues takes ≤5 ms, allowing >200 sorts per second without impacting workload.

Stable Rollout Strategy

Online Rollback

Instead of restarting services, we expose optimization parameters via configuration files. Updating these parameters at runtime changes scheduler behavior without a restart, reducing downtime during peak hours.

Data Auto‑Verification

We periodically compare the legacy on‑the‑fly resourceUsage (oldResourceUsage) with the new incremental calculation (newResourceUsage). Discrepancies trigger alerts and automatic correction using the trusted old values.

Conclusion and Future Outlook

The article outlines the end‑to‑end performance optimization workflow for Meituan's YARN fair scheduler, from defining macro metrics to fine‑grained instrumentation, applying algorithmic improvements, and ensuring safe production rollout.

Define high‑level performance indicators.

Identify fine‑grained metrics to locate bottlenecks.

Use efficient pressure‑testing tools.

Apply optimizations: reduce algorithmic complexity, eliminate redundant work, parallelize.

Continuously anticipate business growth and iterate optimizations.

Deploy changes cautiously with defensive mechanisms.

Future directions include adopting Hadoop 3.0's Global Scheduling and YARN Federation to scale beyond 10,000 nodes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance OptimizationBig DataResource ManagementSchedulingYARNFair Scheduler
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.