Evolution of Load Balancing Strategies in JD Advertising Online Model System
This article examines the progression of load‑balancing techniques used in JD's advertising online model system, analyzing current challenges, outlining requirements, reviewing static and dynamic strategies, and presenting a multi‑objective, hierarchical approach that improves service availability, resource utilization, and overall system stability.
Load balancing is a critical topic for distributed service architectures, essential for improving resource utilization and service stability in online clusters. This paper starts from the evolution of JD advertising online model system's load‑balancing strategies and focuses on optimal compute scheduling for heterogeneous hardware clusters.
Background
Complex business systems depend heavily on distributed service clusters.
Containerized deployment of heterogeneous nodes leads to performance imbalance.
Hardware component failure rates are unavoidable, requiring fault‑tolerant design.
Traffic spikes during promotions demand a balance between stability and resource cost.
Problems
Load imbalance results in low overall resource utilization.
Single‑node overload can trigger cluster expansion.
Node hardware failures affect overall service availability.
Unpredictable traffic changes cause stability issues.
Requirements
Design a reasonable load‑balancing (LB) strategy to improve resource utilization and service stability, effectively handling complex, variable traffic during large promotions.
Theoretical Foundations
Load balancing can be static (pre‑determined) or dynamic (runtime measured). The goal is to map tasks to machines to minimize execution time.
Load‑Balancing Strategy Summary
Distributed Strategies : Neighbor exchange methods such as diffusion, dimension exchange (DEM), and gradient method (GM).
Centralized Strategies : A designated processor collects global load information and makes balancing decisions.
Hybrid/Hierarchical Strategies : Use hierarchical trees to perform multi‑level balancing across groups of processors.
Algorithm Levels
System‑level LB : DNS load balancing, Nginx reverse‑proxy load balancing, LVS/F5 combined with Nginx.
Application‑level LB : Ribbon (client‑side) and Dubbo (service‑side) with strategies such as random, round‑robin, least‑connections, and locality‑aware.
Evolution Steps
Step 1 – Business‑Specific Adaptation : Consistent‑hash based on user PIN to maintain cache hit rate.
Step 2 – Availability Target : Introduce real‑time node availability metrics; nodes below average availability reduce traffic share, while the whole cluster can trigger degradation protection.
Step 3 – Heterogeneous Hardware Utilization : Add CPU/GPU utilization as secondary objectives, using a two‑level feedback loop to gradually converge resource usage.
Step 4 – Unified LB Framework : Modularize LB logic to unify internal and external services, eliminating isolated compute islands.
Effect Demonstration
During the 2022 618 promotion, the model‑estimation service cluster achieved over 10% improvement in machine resource utilization. Subsequent deployments in later promotions yielded 15%–20% gains and reduced CPU load variance by half.
References
Wang G, Zhang L, Xu W. What Can We Learn from Four Years of Data Center Hardware Failures. IEEE DSN 2017.
Yang JX et al. Survey of Dynamic Load Balancing Strategies for Parallel and Distributed Computing. J. Electronics 2010.
Mirrokni V, Thorup M, Zadimoghaddam M. Consistent Hashing with Bounded Loads. 2016.
https://developer.aliyun.com/article/1325514.
JD Tech Talk
Official JD Tech public account delivering best practices and technology innovation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.