Mastering High‑Quality Service Architecture: Load Balancing, Rate Limiting, Retries & Timeouts
This article distills Bilibili's technical director insights on building high‑service‑quality architectures, covering systematic load‑balancing strategies, sophisticated rate‑limiting mechanisms, robust retry policies, precise timeout controls, and comprehensive approaches to prevent cascading failures in large‑scale systems.
1. Load Balancing
Load balancing is split into front‑end (user‑facing) and data‑center internal layers. Front‑end balancing aims to minimize user latency by routing traffic via DNS, dynamic CDN, and BFE (Baidu Front‑End) routing to the nearest data center, then through API gateways to micro‑services.
Prefer the nearest node.
Schedule based on bandwidth policies to select the appropriate API entry data center.
Balance traffic according to available service capacity.
Internal data‑center balancing should keep CPU usage across nodes uniform. Poor balancing leads to large CPU disparities, making resource scheduling and container orchestration difficult.
Even traffic distribution.
Reliable detection of abnormal nodes.
Scale‑out by adding homogeneous nodes.
Reduce errors and improve availability.
Issues observed include high health‑check costs in RPC point‑to‑point communication and the need for multi‑cluster deployments to avoid single‑cluster failure domains.
2. Rate Limiting
Overload protection is essential; graceful degradation and loss‑tolerant services are preferred. The team implements a distributed quota‑server that each backend queries for per‑client quotas, reducing server request frequency.
Algorithmic details:
Maximum‑minimum fairness algorithm to prevent a single heavy consumer from starving others.
Client‑side fast rejection when quota is exhausted, avoiding unnecessary network traffic.
Formula used (from Google SRE) to probabilistically drop requests: max(0, (requests - K*accepts) / (requests + 1)).
CPU sliding average threshold (CPU > 800) triggers overload protection; the protection condition is (MaxPass * AvgRT) < InFlight.
After activation, CPU hovers around the critical value; a cooldown period prevents rapid oscillation that could otherwise flood the system with requests.
3. Retries
Retry handling follows four principles:
Limit the number of retries and apply distribution‑aware strategies.
Retry only on the failing layer and define global error codes to avoid cascading retries.
Use randomization and exponential back‑off with jitter (e.g., Exponential Backoff + Jitter).
Define retry‑rate metrics for fault diagnosis.
On the client side, rate‑limiting is also applied to prevent excessive attempts against unavailable services.
4. Timeouts
Timeouts are treated as a fail‑fast mechanism. Misconfigured timeouts cause high‑latency services, request pile‑up, thread blockage, and eventual failures.
Process‑internal timeout: check remaining time before each network request; internal computation is usually short and may not need strict limits.
Cross‑process timeout propagation: pass timeout information via RPC context to keep the entire call chain within a reasonable bound (typically under one second).
5. Handling Cascading Failures
Key measures to prevent chain reactions:
Avoid overload through self‑protection mechanisms.
Apply rate limiting to isolate abusive clients and enable graceful degradation.
Implement back‑off‑aware retry strategies to limit traffic amplification.
Control timeouts both intra‑process and across processes.
Strengthen change‑management procedures and penalize destructive actions.
Conduct stress testing beyond error thresholds and regular chaos engineering drills.
Plan capacity expansion, restarts, and removal of harmful traffic.
These practices together form a layered defense that improves overall system reliability under traffic spikes.
Q&A Highlights
Metrics for load balancing: CPU usage (server side), health‑check success rate and latency (client side), and request distribution per backend.
BFE to CLB may traverse public internet or dedicated lines.
Even a few thousand clients generating periodic pings can cause noticeable CPU overhead due to aggregated health‑checks.
Multi‑cluster deployment doubles resources and cost but improves availability for critical services.
Timeout propagation can be overridden in code when necessary.
Node quality vs. capacity is balanced by routing to the nearest non‑overloaded node, prioritizing user‑experience‑critical services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
