Big Data 12 min read

JD's Large‑Scale Hadoop Cluster Resource Management and Scheduling Architecture

This article describes how JD built a multi‑regional, ten‑thousand‑node Hadoop ecosystem, unified resource management with YARN, introduced a three‑level Router scheduling layer, optimized performance, and integrated deep‑learning frameworks to achieve high availability, cost efficiency, and scalable big‑data processing.

JD Tech
JD Tech
JD Tech
JD's Large‑Scale Hadoop Cluster Resource Management and Scheduling Architecture

As JD's e‑commerce business rapidly expanded, the original Hadoop clusters could no longer meet the soaring storage and compute demands, prompting the need for a unified, massive‑scale Hadoop platform that can serve multiple parallel frameworks.

Hadoop, a decade‑old big‑data platform, uses inexpensive commodity machines to form a distributed storage (HDFS) and compute (MapReduce) system; since Hadoop 2.0 the compute layer has been replaced by YARN, which provides resource management and job scheduling.

The core YARN components are ResourceManager, NodeManager, ApplicationMaster, Container, and Client, each responsible for cluster‑wide resource allocation, node‑level execution, application coordination, resource encapsulation, and job submission.

JD introduced a distributed resource‑management and scheduling system that aggregates scattered clusters into a single super‑cluster, allowing all parallel frameworks (Presto, Alluxio, TensorFlow, Caffe, etc.) to share resources via YARN, thereby improving utilization and reducing waste.

By running Docker containers on YARN, JD isolates execution environments, supports GPU and other hardware scheduling, and provides unified logging tools, enabling each user to customize their own runtime without affecting others.

The platform also migrated many legacy clusters to YARN, built deep‑learning on‑YARN solutions, and achieved cross‑region data and job migration through a multi‑datacenter HDFS and scheduling extension.

A new Router component adds a third‑level scheduling abstraction, turning the original two‑level YARN scheduling into a three‑level hierarchy (Router → Sub‑cluster → YARN), allowing dynamic cross‑sub‑cluster resource borrowing and logical queue management.

Router works with a State&Policy Store that persistently records sub‑cluster status, active ResourceManager addresses, and scheduling policies; high‑availability is ensured by deploying multiple Router instances with automatic failover.

The job submission flow is: Client → Router (selects logical queue) → Router forwards to the appropriate ResourceManager → ApplicationMaster requests containers via AMRMProxy → containers are launched on the target sub‑cluster.

To support ten‑thousand‑node clusters, JD developed a queue‑mirroring multi‑path allocation strategy, added multi‑dimensional scheduling rules (memory, load, utilization), and performed RM code optimizations such as lock reduction and simplified resource calculations; MapReduce shuffle services were also tuned.

Benchmark and analysis tools used include HiBench, Hadoop's built‑in benchmarks, GCeasy, FastThread, Linux perf, NMON, and Google performance tools.

Future directions focus on cost reduction by integrating JD's private cloud (Archimedes) for elastic scaling during peak events, and productizing the platform's middleware and services for external customers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SchedulingResource ManagementYARNHadoopJD.com
JD Tech
Written by

JD Tech

Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.