Operations 14 min read

Mastering High‑Quality Service Architecture: Load Balancing, Rate Limiting, Retries & Timeouts

This article distills Bilibili's technical director insights on building high‑service‑quality architectures, covering systematic load‑balancing strategies, sophisticated rate‑limiting mechanisms, robust retry policies, precise timeout controls, and comprehensive approaches to prevent cascading failures in large‑scale systems.

dbaplus Community

Mar 25, 2021

Mastering High‑Quality Service Architecture: Load Balancing, Rate Limiting, Retries & Timeouts

1. Load Balancing

Load balancing is split into front‑end (user‑facing) and data‑center internal layers. Front‑end balancing aims to minimize user latency by routing traffic via DNS, dynamic CDN, and BFE (Baidu Front‑End) routing to the nearest data center, then through API gateways to micro‑services.

Prefer the nearest node.

Schedule based on bandwidth policies to select the appropriate API entry data center.

Balance traffic according to available service capacity.

Internal data‑center balancing should keep CPU usage across nodes uniform. Poor balancing leads to large CPU disparities, making resource scheduling and container orchestration difficult.

Even traffic distribution.

Reliable detection of abnormal nodes.

Scale‑out by adding homogeneous nodes.

Reduce errors and improve availability.

Issues observed include high health‑check costs in RPC point‑to‑point communication and the need for multi‑cluster deployments to avoid single‑cluster failure domains.

2. Rate Limiting

Overload protection is essential; graceful degradation and loss‑tolerant services are preferred. The team implements a distributed quota‑server that each backend queries for per‑client quotas, reducing server request frequency.

Algorithmic details:

Maximum‑minimum fairness algorithm to prevent a single heavy consumer from starving others.

Client‑side fast rejection when quota is exhausted, avoiding unnecessary network traffic.

Formula used (from Google SRE) to probabilistically drop requests: max(0, (requests - K*accepts) / (requests + 1)).

CPU sliding average threshold (CPU > 800) triggers overload protection; the protection condition is (MaxPass * AvgRT) < InFlight.

After activation, CPU hovers around the critical value; a cooldown period prevents rapid oscillation that could otherwise flood the system with requests.

3. Retries

Retry handling follows four principles:

Limit the number of retries and apply distribution‑aware strategies.

Retry only on the failing layer and define global error codes to avoid cascading retries.

Use randomization and exponential back‑off with jitter (e.g., Exponential Backoff + Jitter).

Define retry‑rate metrics for fault diagnosis.

On the client side, rate‑limiting is also applied to prevent excessive attempts against unavailable services.

4. Timeouts

Timeouts are treated as a fail‑fast mechanism. Misconfigured timeouts cause high‑latency services, request pile‑up, thread blockage, and eventual failures.

Process‑internal timeout: check remaining time before each network request; internal computation is usually short and may not need strict limits.

Cross‑process timeout propagation: pass timeout information via RPC context to keep the entire call chain within a reasonable bound (typically under one second).

5. Handling Cascading Failures

Key measures to prevent chain reactions:

Avoid overload through self‑protection mechanisms.

Apply rate limiting to isolate abusive clients and enable graceful degradation.

Implement back‑off‑aware retry strategies to limit traffic amplification.

Control timeouts both intra‑process and across processes.

Strengthen change‑management procedures and penalize destructive actions.

Conduct stress testing beyond error thresholds and regular chaos engineering drills.

Plan capacity expansion, restarts, and removal of harmful traffic.

These practices together form a layered defense that improves overall system reliability under traffic spikes.

Q&A Highlights

Metrics for load balancing: CPU usage (server side), health‑check success rate and latency (client side), and request distribution per backend.

BFE to CLB may traverse public internet or dedicated lines.

Even a few thousand clients generating periodic pings can cause noticeable CPU overhead due to aggregated health‑checks.

Multi‑cluster deployment doubles resources and cost but improves availability for critical services.

Timeout propagation can be overridden in code when necessary.

Node quality vs. capacity is balanced by routing to the nearest non‑overloaded node, prioritizing user‑experience‑critical services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Backend Architecture load balancing SRE system reliability retry strategy timeout management

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.