How Vivo Upgraded a Million‑Node YARN Cluster: Architecture, Scheduler Switch, and Performance Optimizations
This article details Vivo's end‑to‑end upgrade of a YARN 2.6.0 cluster to a modern version for a million‑node, hundred‑thousand‑tasks‑per‑day platform, covering architectural evolution, scheduler migration, compatibility fixes, performance tuning, and service‑continuity strategies.
Background
Vivo’s early Hadoop YARN 2.6.0 deployment could no longer meet the resource‑management efficiency and scalability required by a cluster of tens of thousands of nodes handling nearly one million daily tasks. The version lacked Node Labels, Timeline Server, and YARN Federation, and most compute engines except Spark 3 could not read erasure‑coded cold data.
YARN Architecture Evolution
Since 2020 the team systematically refactored the platform, moving from a monolithic cluster to a fine‑grained, elastic resource‑management system. The evolution enabled better isolation, fair scheduling, and integration of cold‑storage access, laying a foundation for AI, multi‑cloud, and serverless workloads.
Upgrade Process and Challenges
The upgrade required replacing the Fair Scheduler with the Capacity Scheduler, upgrading ResourceManager and NodeManager components, and aligning all compute‑engine Hadoop client versions. A phased rollout—ResourceManager → NodeManager → compute engines → History Server—minimized compatibility risks.
3.1 Scheduler Switch
Because the older Fair Scheduler was heavily tuned, the team performed extensive compatibility work to adopt the Capacity Scheduler. Key actions:
Full‑path queue names : YARN 3.1 dropped support for hierarchical queue names. By back‑porting YARN‑9879 (merged in 3.3) and applying 35 PRs, full‑path names were re‑enabled and leaf‑queue uniqueness constraints relaxed.
Plan mode : The team replicated Cloudera Manager’s queue‑plan feature in an internal tool, preserving dynamic minimum/maximum capacity adjustments.
Custom queue limits : Implemented YARN‑9930 to enforce max running apps per queue, and YARN‑10531 to disable user‑limit‑factor by default, improving elasticity.
3.2 Performance Optimizations
To reduce the CPU cost of sorting queues and applications during container allocation, a caching mechanism was added:
Two new configs—
yarn.resourcemanager.capacity.child-queue-cache-refresh-interval-msand
yarn.resourcemanager.capacity.child-app-cache-refresh-interval-ms—control cache lifetimes.
With a 10 ms cache, sorting frequency dropped from ~500 k per minute to ~18 k, cutting CPU load by >96 %.
Additional optimizations eliminated invalid allocations caused by tasks with no pending resources and by parent‑queue resource exhaustion. The DominantResourceCalculator bugs (YARN‑11083, YARN‑10903) were fixed by using Resources.fitsIn or Resources.componentwiseMin for precise multidimensional checks.
3.3 Service Continuity
During the upgrade, ResourceManager and NodeManager operated in a mixed Hadoop 2/Hadoop 3 state. To guarantee seamless failover:
A conversion tool translated fair‑scheduler.xml to capacity‑scheduler.xml, preserving queue hierarchy, weights, and limits.
NodeManager downgrade issues caused by new container‑state fields were solved by enhancing the low‑version recovery logic to ignore unknown fields safely.
Token serialization incompatibility (Java byte stream vs. Protobuf) was addressed with a dynamic switch: the ResourceManager starts with the legacy format, then flips to Protobuf after all NodeManagers are upgraded.
Engine‑Side Upgrade
Compatibility across compute engines was achieved by:
Standardising Hadoop client classpaths ( yarn.application.classpath, mapreduce.application.classpath) so every container uses the same Hadoop 3 libraries.
Forcing tasks to use submission‑node configuration rather than NodeManager‑local hdfs-site.xml, preventing accidental replication factor changes.
Upgrading the Hive client to a version compatible with Hadoop 3 (via HIVE‑16081) and distributing the patched JAR via an internal Maven repository.
Gradual Migration of JAR Tasks
A three‑step gray‑scale process moved JAR‑type jobs from Hadoop 2 to Hadoop 3:
Tag all existing jobs with a hadoop2 label that injects VIVO_HADOOP_VERSION=2 at runtime.
Gradually remove the label; jobs without the label default to Hadoop 3. Failures trigger automatic re‑labeling to retry on Hadoop 2.
Fix incompatibilities discovered in step 2 and repeat the gray‑scale until all jobs run on Hadoop 3.
Summary and Outlook
The massive YARN upgrade eliminated technical debt, introduced fine‑grained scheduling, and unlocked lower cost, higher elasticity, and stronger governance. Future work includes YARN Federation, GPU scheduling, and deeper resource isolation to build a truly demand‑driven, intelligent data infrastructure.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
