How LinkedIn Scaled Hadoop to 11,000 Nodes and Solved YARN Delays
LinkedIn’s engineers detail how they repeatedly doubled their Hadoop cluster to over 11,000 nodes, tackled YARN scheduling delays caused by workload imbalances, and created the DynoYARN simulation tool to predict performance impacts of massive scaling.
LinkedIn’s Hadoop Scaling Journey
Although Hadoop is no longer headline news, LinkedIn continues to rely on it for large‑scale data analytics and machine‑learning workloads.
In a recent blog post, engineers Keqiu Hu, Jonathan Hung, Haibo Chen and Sriram Rao explain that LinkedIn doubles the size of its Hadoop cluster each year to keep pace with exponential data growth.
The company had previously addressed HDFS NameNode limitations, but YARN—the resource scheduler—still struggled when the primary communication service cluster was merged with a data‑obfuscation cluster, causing job delays of several hours even with ample capacity.
Initial investigations suspected a bug in YARN’s partition‑handling logic, yet no code defect was found. The delay was traced to YARN’s difficulty allocating containers when different queues exhibited disparate workload characteristics, leading to temporary deadlocks.
Engineers observed that the scheduler became stuck on workloads in queue A while resources were scarce for queue B. The merged cluster’s primary partition ran AI experiments and long‑running Spark jobs, whereas the secondary partition handled high‑speed MapReduce jobs.
Because the merge slowed container scheduling, fairness issues surfaced. Although a patch was submitted to Apache Hadoop, the root cause remained unclear until the team fixed inefficiencies in heartbeat monitoring that affected queue assignment.
After optimizing the logic so that a node’s heartbeat is considered only for applications within the same partition, the worst‑case scenario saw a nine‑fold efficiency gain.
To prevent future scaling bottlenecks, LinkedIn built DynoYARN, a tool that simulates YARN clusters of arbitrary size and workload. DynoYARN predicts that the current cluster can grow to about 11,000 nodes before latency exceeds ten minutes; expanding to 12,000 nodes would push latency toward twenty minutes, breaching SLAs.
DynoYARN can also forecast the performance impact of new applications. Recognizing YARN’s single‑threaded architecture limits scalability, a Microsoft subsidiary is developing a new cluster coordinator project.
LinkedIn also employs a Round‑Robin load‑balancing service that exposes a simple REST API to route YARN applications across multiple Hadoop clusters dynamically.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
