Big Data 24 min read

Performance Optimization Practices for Meituan's Hadoop YARN Fair Scheduler

Meituan improved its custom Hadoop YARN Fair Scheduler by pre‑computing resource usage, filtering zero‑demand jobs, and parallelizing queue sorting, which reduced sorting time from 30 s to 5 s per minute, boosted container‑per‑second throughput to 50 k, enabled live roll‑backs, and prepared the system for clusters up to 10 k nodes and future scaling to hundreds of thousands.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
Performance Optimization Practices for Meituan's Hadoop YARN Fair Scheduler

YARN is the resource‑management layer of Hadoop, responsible for allocating compute resources and scheduling jobs across a cluster. Meituan builds its own YARN branch based on the community 2.7.1 version and uses it to support offline, real‑time, and machine‑learning workloads.

Typical workloads include:

Offline jobs: Hive on MapReduce and Spark‑SQL data‑warehouse tasks.

Real‑time jobs: Spark‑Streaming and Flink streaming tasks.

Machine‑learning jobs: TensorFlow, MXNet, and Meituan’s own large‑scale ML system (MLX).

The cluster faces severe scalability problems. With 1,000 nodes (each providing 100 CPU cores) the scheduler can only dispatch 50,000 tasks per minute, while the demand exceeds 100,000 CPU‑core‑minutes, resulting in a utilization of only 50 %. When the cluster grows to 5,000 nodes without improving the scheduler, utilization drops to 10 %.

Overall Architecture

YARN Architecture

YARN finds suitable resources for jobs, launches tasks, and manages job lifecycles. Detailed design can be found in the official Hadoop documentation.

Resource Abstraction

YARN abstracts resources in two dimensions: CPU and memory. The following Java‑like classes illustrate the abstraction:

class Resource{int cpu; // number of CPU cores
int memoryMb; // memory in MB
}

Job requests are expressed as List[ResourceRequest]:

class ResourceRequest{int numContainers; // number of containers needed
Resource capability; // resources per container
}

The scheduler replies with List[Container]:

class Container{ContainerId containerId; // globally unique ID
Resource capability; // container resources
String nodeHttpAddress; // NodeManager hostname where the container runs
}

YARN Scheduling Architecture

The scheduler consists of several components (ResourceScheduler, AsyncDispatcher, ResourceTrackerService, ApplicationMasterService, AppMaster). The scheduling flow is:

AppMaster sends a List[ResourceRequest] to YARN and receives already allocated List[Container] from the scheduler.

NodeManager heartbeats trigger the scheduler to allocate containers for that node.

YARN uses pluggable schedulers; Meituan adopts the FairScheduler.

Fair Scheduler

Jobs (Apps) are leaves of a tree‑shaped queue hierarchy. The scheduler locks the FairScheduler object to avoid concurrent modifications, then selects a node, traverses the queue tree, and finally picks an App to allocate a container.

Key steps for each queue during scheduling:

Pre‑check: verify that the queue’s resource usage does not exceed its quota.

Sort child queues/apps according to the fairness policy.

Recursively schedule child queues/apps.

Example path: ROOT → ParentQueueA → LeafQueueA1 → App11, where a container is allocated to App11 on the selected node.

Performance Evaluation

Traditional online‑service metrics (QPS, TP99) are not suitable for a scheduler because RPC latency is decoupled from actual scheduling latency. Instead, Meituan defines two business‑level metrics:

validSchedule : a binary indicator (1 = scheduler meets demand, 0 = fails). A minute is considered successful if either cluster usage > 90 % or there is no pending demand.

CPS (Containers per Second) : the number of containers the scheduler can dispatch per second. CPS is used in stress tests to determine the upper bound of scheduling capacity.

Formulas used in the evaluation:

validPending = min(queuePending, QueueMaxQuota)
if (usage/total > 90% || validPending == 0) {
    validSchedulePerMin = 1;
} else if (validPending > 0 && usage/total < 90%) {
    validSchedulePerMin = 0;
}

Aggregated daily metric: validSchedulePerDay = ΣvalidSchedulePerMin / 1440. Current production targets are validSchedulePerMin > 0.9 and validSchedulePerDay > 0.99.

Key Optimization Points

Optimizing the Sorting Comparison Function

The sorting step consumed ~30 s per minute, dominating total scheduling time. The original comparator repeatedly recomputed resourceUsage by recursively summing all descendant apps, leading to O(N²) complexity.

Optimization: pre‑compute resourceUsage when a container is allocated or released, updating parent queues in O(1) time.

class FairScheduler{ synchronized Resource attemptScheduling(NodeId node){
    root.assignContainer(node);
 }
}

class Queue{ Resource assignContainer(NodeId node){
    if(!preCheck(node)) return;
    sort(this.children);
    if(this.isParent){
        for(Queue q: this.children) q.assignContainer(node);
    } else {
        for(App app: this.runnableApps) app.assignContainer(node);
    }
 }
}

Result: sorting time dropped from 30 s to 5 s per minute.

Skipping Jobs Without Resource Demand

After the previous optimization, the “skip‑no‑demand” line grew to 20 s per minute. By filtering out queues/apps with zero demand before sorting, this time became negligible.

Parallel Queue Sorting

Because sorting is performed for every container allocation, its cost grows linearly with the number of queues and apps. The new design decouples sorting from the dispatch thread and runs it in a thread pool, each thread handling one queue’s sort after cloning the necessary metadata.

Performance gains:

Sorting 2,000 queues once takes ≤ 5 ms (≈ 2‑5 ms).

Under a pressure test of 10,000 apps, 12,000 nodes, 2,000 queues, and 40 s container runtime, CPS reached 50,000, fully saturating the cluster.

Stability and Roll‑back Strategies

Online roll‑back is slow and causes downtime. All optimizations are exposed as configurable parameters, allowing a live switch back to the original logic without restarting services.

To avoid inconsistent reads of configuration during a scheduling cycle, the scheduler copies all relevant parameters at the start of each cycle.

Future Outlook

Current optimizations support clusters up to ~10,000 nodes. For larger scales (100k‑1M nodes), Meituan is following community developments such as Hadoop 3.0’s Global Scheduling and YARN Federation, which enable horizontal scaling of schedulers across multiple clusters.

Authors

Shilong and Tingwen , R&D engineers in Meituan’s User Platform Data & Algorithm Department.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataResource ManagementYARNHadoopFair Scheduler
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.