How JD’s Advertising Platform Optimizes Load Balancing for Heterogeneous Clusters
Exploring the evolution of JD’s advertising online model system, this article examines the challenges of heterogeneous hardware load balancing, outlines static and dynamic strategies—including DNS, Nginx, LVS, Ribbon, and Dubbo—and presents a multi‑objective framework that improves service availability and resource utilization, achieving up to 20%+ efficiency gains.
Load balancing is a perennial topic in distributed service architectures, essential for improving resource utilization and service stability. This article starts from the evolution of JD’s advertising online model system’s load‑balancing strategies and focuses on solving heterogeneous hardware cluster load‑balancing under optimal compute scheduling.
1. Background
1.1 Current Status
Production environments rely heavily on distributed service clusters.
Containerized deployment of heterogeneous nodes leads to inevitable performance imbalance.
Hardware component failure rates are unavoidable; upper‑layer applications must consider fault tolerance.
Traffic spikes during promotions are hard to predict, requiring a balance between stability and resource cost.
1.2 Problems
Load imbalance within the cluster results in low overall resource utilization.
Single‑node overload can trigger cluster‑wide scaling.
Occasional hardware anomalies (CPU, NIC, memory) affect overall service availability.
Unpredictable traffic during large promotions threatens service stability.
1.3 Requirements
Design a reasonable load‑balancing (LB) strategy to improve cluster resource utilization and service stability, effectively handling complex, variable traffic during major promotions.
2. Theory Foundation
2.1 General Load‑Balancing Problem
Load balancing is a key technology for improving system resource utilization and parallel computing performance, classified into static and dynamic types. Static LB pre‑assigns load before execution; dynamic LB (DLB) determines load division at runtime.
Given a set of compute‑communication tasks and a topology‑connected set of machines, the goal is to map tasks to machines to minimize total execution time.
2.2 Load‑Balancing Strategy Summary
2.2.1 Distributed Strategies
Basic neighbor exchange methods (diffusion, dimension exchange, gradient) achieve global balance through iterative local exchanges.
2.2.2 Centralized Strategies
A designated processor collects global load information and makes balancing decisions.
2.2.3 Hybrid/Hierarchical Strategies
Hierarchical trees built on topology enable multi‑level balancing: nodes are grouped into domains, each domain balances internally, and higher levels balance among domains, culminating in global balance at the root.
3. Algorithm Hierarchy
3.1 System‑Level Load Balancing
DNS Load Balancing : Resolves a domain to multiple IPs; DNS servers select an IP based on a policy to distribute traffic.
Nginx Load Balancing : Acts as a reverse proxy, distributing requests to multiple back‑ends to improve response speed and reliability.
LVS/F5 + Nginx : Combines layer‑4 (LVS/F5) and layer‑7 (Nginx) balancing for high‑throughput scenarios.
3.2 Application‑Level Load Balancing
Ribbon : Client‑side HTTP/TCP load balancer offering strategies such as random, round‑robin, and least‑active.
Dubbo : Distributed service framework providing load‑balancing strategies (random, round‑robin, least‑active) via configuration or code.
4. Practice – Evolution of JD’s Model System LB Strategy
4.1 Common Strategies
Static: Round Robin, Random – simple, fast, assume homogeneous nodes.
Dynamic: Least Connections, Locality‑Aware – adjust based on real‑time node feedback.
4.2 Evolution Steps
Adaptation to Service Characteristics : For online feature services, use consistent‑hash based on user PIN to maintain cache hit rate.
Introduce Availability Target : Monitor per‑node success/failure rates, adjust traffic when node availability falls below cluster average, trigger degradation and recovery.
Add Heterogeneous Hardware Utilization : Incorporate CPU/GPU usage metrics as secondary objectives, adjusting traffic to maximize resource utilization while meeting availability.
Unified LB Framework : Modularize LB logic across model system modules, enabling end‑to‑end optimal compute scheduling.
4.3 Dual‑Objective Feedback Mechanism
The strategy uses service availability as a primary goal and CPU/GPU utilization as a secondary goal. Nodes are divided into a “refuse list” (reduce traffic) and an “accept list” (increase traffic) based on their deviation from target metrics.
4.4 Availability‑Driven Protection
Nodes with success rates below cluster average are filtered; the system periodically updates average success rates and triggers degradation when trends worsen, restoring when conditions improve.
4.5 Progressive CPU Utilization Convergence
Nodes start in the refuse list with zero adjustment ratio. Cluster‑wide average CPU load is computed (load_ref). Each node’s deviation (diff) guides the update of its traffic‑adjustment ratio through iterative formulas, gradually converging to balanced utilization.
4.6 Convergence Domain & Weight Decay
Introduce a tolerance range [‑Δ_ref, Δ_ref] around the target load, allowing convergence to a region rather than a single point, and periodically decay weights to reduce impact on consistent‑hash distribution.
4.7 Summary of Benefits
Balances service availability and performance metrics.
Gradual convergence with stable dynamic weighting.
Supports heterogeneous hardware (CPU, GPU) utilization.
Unified framework bridges internal and external services, achieving 10‑20%+ resource efficiency improvements during major promotions.
5. Experience & Lessons
5.1 Exception Handling in Weight Adjustment
Normalize weights to prevent divergence.
Exclude abnormal node data to ensure the system never degrades below its initial state.
5.2 Proactive Throttling under Performance Limits
Effective balancing assumes correlation between traffic changes and objectives; when nodes hit limits, correlation may break, requiring proactive throttling.
References
Wang G, Zhang L, Xu W. What Can We Learn from Four Years of Data Center Hardware Failures. IEEE Dependable Systems and Networks, 2017.
杨际祥, 谭国真, 王荣生. 并行与分布式计算动态负载均衡策略综述. 电子学报, 2010.
Mirrokni V, Thorup M, Zadimoghaddam M. Consistent Hashing with Bounded Loads, 2016.
https://developer.aliyun.com/article/1325514
JD Cloud Developers
JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.