Operations 17 min read

How JD’s Advertising Platform Optimizes Load Balancing for Heterogeneous Clusters

Exploring the evolution of JD’s advertising online model system, this article examines the challenges of heterogeneous hardware load balancing, outlines static and dynamic strategies—including DNS, Nginx, LVS, Ribbon, and Dubbo—and presents a multi‑objective framework that improves service availability and resource utilization, achieving up to 20%+ efficiency gains.

JD Cloud Developers

Aug 2, 2024

How JD’s Advertising Platform Optimizes Load Balancing for Heterogeneous Clusters

Load balancing is a perennial topic in distributed service architectures, essential for improving resource utilization and service stability. This article starts from the evolution of JD’s advertising online model system’s load‑balancing strategies and focuses on solving heterogeneous hardware cluster load‑balancing under optimal compute scheduling.

1. Background

1.1 Current Status

Production environments rely heavily on distributed service clusters.

Containerized deployment of heterogeneous nodes leads to inevitable performance imbalance.

Hardware component failure rates are unavoidable; upper‑layer applications must consider fault tolerance.

Traffic spikes during promotions are hard to predict, requiring a balance between stability and resource cost.

1.2 Problems

Load imbalance within the cluster results in low overall resource utilization.

Single‑node overload can trigger cluster‑wide scaling.

Occasional hardware anomalies (CPU, NIC, memory) affect overall service availability.

Unpredictable traffic during large promotions threatens service stability.

1.3 Requirements

Design a reasonable load‑balancing (LB) strategy to improve cluster resource utilization and service stability, effectively handling complex, variable traffic during major promotions.

2. Theory Foundation

2.1 General Load‑Balancing Problem

Load balancing is a key technology for improving system resource utilization and parallel computing performance, classified into static and dynamic types. Static LB pre‑assigns load before execution; dynamic LB (DLB) determines load division at runtime.

Given a set of compute‑communication tasks and a topology‑connected set of machines, the goal is to map tasks to machines to minimize total execution time.

2.2 Load‑Balancing Strategy Summary

2.2.1 Distributed Strategies

Basic neighbor exchange methods (diffusion, dimension exchange, gradient) achieve global balance through iterative local exchanges.

2.2.2 Centralized Strategies

A designated processor collects global load information and makes balancing decisions.

2.2.3 Hybrid/Hierarchical Strategies

Hierarchical trees built on topology enable multi‑level balancing: nodes are grouped into domains, each domain balances internally, and higher levels balance among domains, culminating in global balance at the root.

3. Algorithm Hierarchy

3.1 System‑Level Load Balancing

DNS Load Balancing : Resolves a domain to multiple IPs; DNS servers select an IP based on a policy to distribute traffic.

Nginx Load Balancing : Acts as a reverse proxy, distributing requests to multiple back‑ends to improve response speed and reliability.

LVS/F5 + Nginx : Combines layer‑4 (LVS/F5) and layer‑7 (Nginx) balancing for high‑throughput scenarios.

3.2 Application‑Level Load Balancing

Ribbon : Client‑side HTTP/TCP load balancer offering strategies such as random, round‑robin, and least‑active.

Dubbo : Distributed service framework providing load‑balancing strategies (random, round‑robin, least‑active) via configuration or code.

4. Practice – Evolution of JD’s Model System LB Strategy

4.1 Common Strategies

Static: Round Robin, Random – simple, fast, assume homogeneous nodes.

Dynamic: Least Connections, Locality‑Aware – adjust based on real‑time node feedback.

4.2 Evolution Steps

Adaptation to Service Characteristics : For online feature services, use consistent‑hash based on user PIN to maintain cache hit rate.

Introduce Availability Target : Monitor per‑node success/failure rates, adjust traffic when node availability falls below cluster average, trigger degradation and recovery.

Add Heterogeneous Hardware Utilization : Incorporate CPU/GPU usage metrics as secondary objectives, adjusting traffic to maximize resource utilization while meeting availability.

Unified LB Framework : Modularize LB logic across model system modules, enabling end‑to‑end optimal compute scheduling.

4.3 Dual‑Objective Feedback Mechanism

The strategy uses service availability as a primary goal and CPU/GPU utilization as a secondary goal. Nodes are divided into a “refuse list” (reduce traffic) and an “accept list” (increase traffic) based on their deviation from target metrics.

4.4 Availability‑Driven Protection

Nodes with success rates below cluster average are filtered; the system periodically updates average success rates and triggers degradation when trends worsen, restoring when conditions improve.

4.5 Progressive CPU Utilization Convergence

Nodes start in the refuse list with zero adjustment ratio. Cluster‑wide average CPU load is computed (load_ref). Each node’s deviation (diff) guides the update of its traffic‑adjustment ratio through iterative formulas, gradually converging to balanced utilization.

4.6 Convergence Domain & Weight Decay

Introduce a tolerance range [‑Δ_ref, Δ_ref] around the target load, allowing convergence to a region rather than a single point, and periodically decay weights to reduce impact on consistent‑hash distribution.

4.7 Summary of Benefits

Balances service availability and performance metrics.

Gradual convergence with stable dynamic weighting.

Supports heterogeneous hardware (CPU, GPU) utilization.

Unified framework bridges internal and external services, achieving 10‑20%+ resource efficiency improvements during major promotions.

5. Experience & Lessons

5.1 Exception Handling in Weight Adjustment

Normalize weights to prevent divergence.

Exclude abnormal node data to ensure the system never degrades below its initial state.

5.2 Proactive Throttling under Performance Limits

Effective balancing assumes correlation between traffic changes and objectives; when nodes hit limits, correlation may break, requiring proactive throttling.

References

Wang G, Zhang L, Xu W. What Can We Learn from Four Years of Data Center Hardware Failures. IEEE Dependable Systems and Networks, 2017.

杨际祥, 谭国真, 王荣生. 并行与分布式计算动态负载均衡策略综述. 电子学报, 2010.

Mirrokni V, Thorup M, Zadimoghaddam M. Consistent Hashing with Bounded Loads, 2016.

https://developer.aliyun.com/article/1325514

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems load balancing resource utilization service availability heterogeneous hardware

Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.