How Pinterest Scaled a Hadoop Upgrade Across 17k Nodes
Pinterest’s Monarch batch‑processing platform, built on over 17 k YARN nodes in AWS, was upgraded from Hadoop 2.7.1 to 2.10.0 using a phased, cluster‑by‑cluster strategy that balanced minimal downtime, extensive validation, and custom patches to handle compatibility and dependency issues.
Monarch is Pinterest’s batch‑processing platform consisting of more than 30 Hadoop YARN clusters and over 17 k EC2 nodes. In early 2021 the platform still ran Hadoop 2.7.1, and the growing number of internal patches made a version upgrade necessary. Pinterest chose Hadoop 2.10.0, the latest Hadoop 2 release at the time.
Challenges
Since the platform’s inception in 2016, workloads have grown and evolved, requiring hundreds of internal changes. Most critical batch jobs run on Monarch, so the upgrade had to avoid cluster downtime or SLA impact.
Upgrade Strategy
The upgrade was split into two phases. Phase 1 upgraded the platform itself from Hadoop 2.7 to Hadoop 2.10 while allowing user jobs to continue using Hadoop 2.7 dependencies. Phase 2 upgraded user‑defined applications to Hadoop 2.10. Both phases required a step‑by‑step rollout because of the platform’s size.
Upgrade each Monarch cluster one by one.
Batch‑upgrade user applications to bind to Hadoop 2.10 instead of 2.7.
Because there was no flexible build pipeline to produce two versions of a job with separate Hadoop dependencies, Pinterest had to run extensive pre‑upgrade validation to ensure that the majority of Hadoop 2.7‑built applications would still run on the 2.10 clusters.
Hadoop 2.10 Release Preparation
Pinterest’s Hadoop 2.7 branch contained many internal patches that needed to be ported to Apache Hadoop 2.10. Significant changes between the two versions made this non‑trivial. Examples of patches that were back‑ported include:
DirectOutputFileCommitter to write task output directly to S3, avoiding copy overhead.
Endpoints for ApplicationMaster and HistoryServer to expose per‑task counters.
Range‑lookup for container logs to fetch partial logs.
Node‑Id partitioning for log aggregation to avoid S3 request‑rate limits.
Fail‑over handling for new NameNode IP addresses.
Preempting‑reducers disablement when mapper‑to‑total‑mapper ratio exceeds a threshold.
Disk‑usage monitoring thread in the ApplicationMaster to terminate jobs that exceed limits.
Cluster Upgrade Methods Explored
Method 1: Cross‑Cluster Routing (CCR)
CCR, described in the “Efficient Resource Management” paper, creates a new Hadoop 2.10 cluster and gradually routes workloads from the old 2.7 cluster. If problems arise, workloads can be routed back, fixed, and re‑routed. Evaluation on small production and dev clusters showed two drawbacks: the cost of building a parallel cluster for each migration (expensive for clusters with thousands of nodes) and the time‑consuming bulk workload migration.
Method 2: Rolling Upgrade
Rolling upgrade of Worker nodes was considered, but it risked affecting all workloads on the cluster and made rollback expensive.
Method 3: In‑place Upgrade
Inspired by instance‑type canary upgrades, this approach inserts new instance‑type canary hosts into an auto‑scaling group, evaluates them against the base group, expands the canary group, and then shrinks the base group. The steps are:
Add Hadoop 2.10 Worker nodes (HDFS DataNode and YARN NodeManager) to the Hadoop 2.7 cluster.
Identify and fix any issues that arise.
Gradually increase the number of 2.10 workers while decreasing 2.7 workers until all workers are 2.10.
Upgrade all master services (NameNode, JournalNodes, ResourceManager, HistoryServer, etc.) in a similar replacement fashion.
Extensive testing on production Monarch clusters showed a seamless upgrade experience with only minor issues.
Final Upgrade Plan
Because most jobs were built with Hadoop 2.7 jars placed in the distributed cache, they could inadvertently load 2.7 jars on 2.10 nodes, causing dependency conflicts. After evaluating the three methods, Pinterest selected Method 3 (in‑place upgrade) for its speed and ability to resolve most dependency problems quickly. When a job could not be fixed promptly, CCR was used to route it back to a Hadoop 2.7 cluster for later remediation.
Issues Encountered and Fixes
Incompatible Behavior
NodeManager restart killed containers because the new default yarn.nodemanager.recovery.supervised was FALSE. Setting it to TRUE prevented container cleanup during NM restarts.
ApplicationPriority field missing in protobuf responses caused jobs to hang; the field is ignored when absent.
HADOOP‑13680: fs.s3a.readahead.range accepted “32M” format from Hadoop 2.8 onward, which Hadoop 2.7 could not parse. A compatibility fix was added to Hadoop 2.7.
Extra spaces introduced in io.serialization values in Hadoop 2.10 caused ClassNotFound errors; spaces were stripped.
Dependency Issues
Hadoop 2.7 jars in the distributed cache caused runtime conflicts; a fix prevented these jars from being added.
Incompatible woodstox-core-5.0.3.jar (Hadoop 2.10) versus wstx-asl-3.2.7.jar in user apps; the 5.0.3 jar was shaded.
Internal library S3DoubleWrite no longer needed and was removed.
Some Hadoop 2.7 libraries were bundled in user Bazel jars, leading to runtime conflicts; the solution was to decouple user jars from Hadoop jars.
Other Problems
Rollback of NameNode to Hadoop 2.7 failed because the new DataNode did not send block reports; a manual trigger and a fix for HDFS‑12749 were applied.
Hadoop streaming jobs bundled with Hadoop 2.7 jars could not find the expected jars on Hadoop 2.10 nodes. The fix removed version strings from the bundled Hadoop jar so the cluster‑provided jar is used at runtime.
Upgrading User Programs to Hadoop 2.10
To upgrade user applications, Pinterest ensured that Hadoop 2.7 jars were not shipped with user jars, allowing the cluster‑provided Hadoop jars to be used. The build environment was switched to Hadoop 2.10.
Decoupling User Applications from Hadoop Jars
Most data pipelines use Bazel‑built fat jars that contain Hadoop 2.7 client libraries. Because the classloader prefers classes from the fat jar, jobs running on Hadoop 2.10 clusters still used Hadoop 2.7 classes. The solution was to stop embedding Hadoop jars in the fat jar and rely on the Hadoop jars installed on the cluster.
Updating Bazel Targets
After decoupling, Hadoop Bazel targets were upgraded from 2.7 to 2.10. New dependency conflicts were identified through build tests and resolved by upgrading the conflicting libraries to compatible versions.
Conclusion
Upgrading more than 17 k nodes from one Hadoop version to another without causing major application breakage was a significant challenge. Pinterest achieved the upgrade with high quality, reasonable speed, and cost‑effectiveness, and the lessons shared are intended to benefit the broader community.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Past Memory Big Data
A popular big-data architecture channel with over 100,000 developers. Publishes articles on Spark, Hadoop, Flink, Kafka and more. Visit the Past Memory Big Data blog at https://www.iteblog.com. Search "Past Memory" on Google or Baidu.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
