Cloud Native 13 min read

Designing High‑Quality Service Architecture Under Traffic Peaks: Load Balancing, Rate Limiting, Retries, Timeouts, and Failure Mitigation

Drawing on Google SRE principles, Bilibili’s technical director outlines a systematic, cloud‑native framework for high‑quality service architecture during traffic peaks, covering frontend and internal load balancing, distributed rate limiting, controlled retries, fail‑fast timeouts, and comprehensive failure‑mitigation strategies.

Tencent Cloud Developer

Apr 22, 2020

Designing High‑Quality Service Architecture Under Traffic Peaks: Load Balancing, Rate Limiting, Retries, Timeouts, and Failure Mitigation

In this article, Bilibili's technical director Mao Jian shares insights from a Cloud+ Community online salon, discussing systematic approaches to high‑quality service architecture under traffic peaks, drawing from Google SRE principles.

Load Balancing : The talk distinguishes frontend load balancing (DNS‑based, minimizing user latency via CDN and BFE routing) and internal data‑center load balancing (aiming for balanced CPU usage across nodes). Key considerations include selecting the nearest node, bandwidth‑aware API routing, and balancing based on service capacity. Problems of uneven load and CPU disparity are illustrated with diagrams.

Rate Limiting : To prevent overload, a distributed quota‑server is introduced, employing a max‑min fair algorithm and client‑side enforcement. The strategy includes per‑client quotas, penalty values for new nodes, and statistical decay to recover penalized nodes. Over‑load protection uses CPU sliding‑window thresholds and adaptive throttling.

Retry Mechanisms : The speaker emphasizes limiting retry attempts, retrying only on failure layers, using exponential backoff with jitter, and defining global error codes to avoid cascading retries. Metrics for retry rates are suggested for diagnostics.

Timeout Control : Timeout is treated as a fail‑fast mechanism. Proper timeout settings prevent request queuing and thread blockage. Both intra‑process and cross‑process timeout propagation are discussed, recommending defensive programming to keep timeout values within reasonable bounds.

Handling Cascading Failures : A comprehensive set of measures is outlined: avoiding overload, applying rate limiting and graceful degradation, careful retry policies, coordinated client‑side flow control, strict timeout propagation, change‑management discipline, stress testing with fault injection, and capacity planning for multi‑cluster deployments.

The Q&A section addresses practical metrics for load balancing (CPU, health, latency), network paths (public vs. private), client‑side load, multi‑cluster costs, and timeout propagation nuances.

Overall, the presentation provides a systematic, cloud‑native reliability framework for large‑scale services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud-native load balancing SRE system reliability rate limiting retry strategy timeout management

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.