How to Keep Recommendation Systems Stable During Sudden Traffic Surges

This article examines the challenges of handling high‑frequency, instantaneous traffic spikes in JD Alliance's recommendation system during major sales events and presents an adaptive, automated degradation and recovery framework that minimizes recommendation loss while maintaining system stability.

JD Cloud Developers
JD Cloud Developers
JD Cloud Developers
How to Keep Recommendation Systems Stable During Sudden Traffic Surges

Background

JD's 618 shopping festival generates massive traffic spikes that stress all system layers. JD Alliance, the affiliate marketing platform, drives traffic via external CPS ads, leading to unpredictable, uneven, and rapid traffic fluctuations across hundreds of off‑site scenarios.

Problems & Challenges

Uncertain traffic forecasts : Hard to provision resources, risking crashes.

Diverse scenario strategies : Varying recommendation models make unified control difficult.

Instantaneous large‑scale spikes : Require second‑level response and adjustment.

Existing Techniques

Typical solutions—rate limiting, pre‑written degradation plans, and auto‑scaling—work for generic services but fall short for recommendation systems:

Rate limiting : Severely degrades personalized recommendations for throttled users.

Pre‑written degradation : Coarse strategies and high manual error risk.

Auto‑scaling : Depends on upstream services and minute‑level response, too slow for sub‑second traffic peaks.

Redefining the Problem

An adaptive capability is needed, featuring:

Scenario‑aware differentiated control.

Fully automated degradation and recovery without human intervention.

Real‑time monitoring and dynamic adjustment.

Intelligent post‑spike recovery to full recommendation.

Minimized recommendation loss through precise degradation.

Practical Solution

The design follows these steps:

Real‑time performance sensing : Configure timeout thresholds per scenario and run guardian coroutines on each recommendation instance to collect response times and timeout rates.

Apply Wilson confidence interval to correct timeout rates during low‑traffic periods.

Scenario‑specific control : Collect latency per scenario and enforce differentiated limits.

Fine‑grained traffic slicing : Only degrade a portion of traffic based on timeout ratios and user activity levels.

Dynamic linear‑programming routing : Optimize the mix of recall, coarse‑ranking, fine‑ranking, and re‑ranking modules under latency constraints to maximize business value.

Real‑time pipeline orchestration : Generate and schedule the actual call pipeline based on the optimal module combination.

Small‑traffic probing and staged recovery : Periodically test a subset of degraded traffic; if recovery succeeds, gradually restore full traffic.

Business‑agnostic API : Provide a generic interface for profit and latency inputs, timeout settings, and degradation queries, enabling low‑cost migration to other services.

Key Components

1. Configurable timeout thresholds per recommendation path.

2. Guardian coroutine for real‑time timeout statistics.

3. Wilson confidence interval correction:

WilsonP = (P + z*z/(2n) - z * sqrt((P*(1-P)+z*z/(4n))/n)) / (1 + z*z/n)

where P is the observed per‑second timeout rate, z = 1.96 for 95% confidence.

Dynamic Linear Programming

Maximize total business profit under latency constraints:

max Σ Wi * Ei   subject to Σ Wi * Ti ≤ latency_limit, Wi ∈ {0,1}

Ei: profit of module i; Ti: current latency of module i.

Results

During the promotion period, traffic loss was reduced by over 90%, the system performed second‑level adaptive degradation, and auto‑scaling restored services within minutes, achieving zero manual intervention and zero incidents.

recommendation systemLinear Programmingreal-time monitoringadaptive degradationtraffic spikes
JD Cloud Developers
Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.