Load Balancing Strategies for Heterogeneous Hardware Clusters in JD Advertising Online Model System
This article examines the evolution, theory, and practical implementation of load balancing strategies for JD Advertising's online model system, focusing on heterogeneous hardware clusters, dual‑objective optimization of service availability and resource utilization, and the resulting performance improvements in large‑scale production environments.
Load balancing is a perennial topic in distributed service architectures, essential for improving resource utilization and service stability. This paper starts from the evolution of load‑balancing strategies in JD Advertising's online model system and concentrates on solving heterogeneous hardware cluster balancing under optimal compute scheduling.
Background : Production environments rely heavily on distributed service clusters, which face containerized heterogeneous nodes, inevitable performance variance, hardware error rates, and unpredictable traffic spikes during major promotions.
Problems : Uneven load leads to low overall utilization, single‑node overload triggers cluster expansion, hardware anomalies affect availability, and traffic volatility threatens stability.
Requirements : Design a reasonable load‑balancing (LB) strategy to boost resource utilization and service stability, especially under complex promotional traffic.
Theoretical Foundations : Load balancing can be static (pre‑determined) or dynamic (runtime‑measured). The goal is to map tasks to machines to minimize execution time.
Load‑Balancing Strategy Overview :
Distributed strategies: neighbor exchange (diffusion, DEM, GM).
Centralized strategies: a designated processor collects global load info and makes decisions.
Hybrid/Hierarchical strategies: build a hierarchical tree to perform multi‑level balancing across domains.
Algorithm Levels :
System‑level: DNS load balancing, Nginx (layer‑7), LVS/F5 (layer‑4).
Application‑level: Ribbon (client‑side HTTP/TCP), Dubbo (service‑side).
Practical Evolution :
Adapt LB to service‑specific characteristics (e.g., consistent‑hash for cache‑heavy feature services).
Introduce service‑availability as a primary objective, adjusting traffic based on node availability.
Add heterogeneous hardware utilization (CPU/GPU) as a secondary objective, forming a multi‑goal hierarchical strategy.
Unify LB frameworks across internal and external modules to eliminate silos and enable optimal compute scheduling.
Dual‑Objective LB Strategy : Uses a “refuse list” and “accept list” to separate nodes based on service‑availability and resource‑utilization metrics, adjusting traffic ratios proportionally to each node’s imbalance degree.
Stages :
Stage‑1: Compute per‑node success/failure counts to obtain average availability.
Stage‑2: Aggregate to get cluster‑wide average availability as the target.
Stage‑3: If a node meets the availability target, proceed to CPU‑utilization balancing; otherwise, continue availability balancing.
Active Protection : Periodically monitor success‑rate trends; trigger degradation when the rate drops and recover when it improves.
Resource‑Utilization Convergence : Initialize all nodes in the refuse list, collect CPU load feedback, compute cluster average, and iteratively update flow‑ratio using defined equations, achieving gradual convergence.
Convergence Domain & Weight Decay : Introduce tolerance intervals for load targets and decay weighting to reduce impact on consistent‑hashing while maintaining stability.
Benefits :
Simultaneously balances service availability and CPU/GPU utilization.
Gradual convergence yields stable dynamic adjustments and supports convergence domains.
Results :
During the 2022 618 promotion, overall machine resource utilization improved by over 10%.
CPU usage variance halved and cache miss rate kept within 2% during the 2022 11.11 promotion, with >15% resource gain.
GPU‑aware balancing extended optimization to thousands of cores, achieving >20% CPU utilization improvement before the 2024 618 event.
References :
1. Wang G, Zhang L, Xu W. What Can We Learn from Four Years of Data Center Hardware Failures. IEEE Dependable Systems and Networks, 2017.
2. 杨际祥, 谭国真, 王荣生. 并行与分布式计算动态负载均衡策略综述. 电子学报, 2010.
3. Mirrokni V, Thorup M, Zadimoghaddam M. Consistent Hashing with Bounded Loads, 2016.
4. https://developer.aliyun.com/article/1325514
JD Retail Technology
Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.