Operations 15 min read

High‑Availability Architecture Practices from Bilibili: Load Balancing, Rate Limiting, Retries, and Timeout Strategies

This article presents Bilibili’s high‑availability design, covering load‑balancing decisions, subset selection, multi‑cluster deployment, adaptive rate limiting, retry policies, timeout propagation, and chain‑failure mitigation, all illustrated with diagrams and practical SRE insights.

Top Architect

Jan 8, 2022

In a recent Cloud+ community talk, Bilibili shared its high‑availability architecture, focusing on how to maintain service quality under traffic spikes using Google SRE principles and systematic availability designs.

Load Balancing

BFE (edge node) selects downstream IDC based on proximity, bandwidth‑based scheduling, and overload avoidance.

When traffic reaches an IDC, the load‑balancing algorithm decides how to distribute requests.

Problem: RPC health‑checks (ping‑pong) consume significant resources, especially when many services maintain long‑living connections.

Solution: Replace a single client‑to‑all‑backends connection with a subset selection algorithm where each client connects only to a small group of backends, as described in the book *Site Reliability Engineering*.

To avoid single‑cluster jitter, deploy multiple clusters.

JSQ (Join‑Shortest‑Queue) may always pick a specific server, lacking a global view; a more balanced approach is needed.

Java services can suffer high latency during GC or FullGC, causing overload; new nodes experience JIT warm‑up jitter, requiring pre‑warming strategies.

After applying the “choice‑of‑2” algorithm, CPU loads across machines converge; clients obtain backend load metrics via middleware or RPC response metadata.

JIT warm‑up can be triggered manually or gradually by feeding traffic with penalty values.

Rate Limiting

QPS‑based limits have pitfalls: different request parameters affect throughput, and static thresholds are hard to maintain in evolving services.

APIs have varying importance; critical APIs may receive higher limits.

Adaptive limiting can replace manual per‑service limits, using historical QPS windows to allocate quota.

Clients calculate required quota based on recent QPS; nodes with uneven load can use a max‑min fairness algorithm to distribute resources more evenly.

When backends constantly reject (e.g., 503), client‑side throttling is needed. A probabilistic drop algorithm uses the ratio of requests to accepts to compute discard probability without external coordinators. requests and accepts are counted; if error rate is high, discard probability approaches 1, protecting the service.

Chain failures often start with a single overloaded node, causing a snowball effect; overload protection aims to self‑preserve.

Bilibili adopts a TCP‑BBR‑style detection algorithm: when CPU exceeds 80 % and throughput spikes, excess requests are dropped.

CPU sliding averages smooth spikes; different API priorities can have distinct thresholds (e.g., 80 % for low priority, 90 % for high priority).

Throughput is estimated by Little’s Law: QPS × latency = throughput . Dropped traffic quickly reduces CPU, but the algorithm adds a cooldown period to avoid oscillation.

Retry Strategies

BFE: dynamic CDN

SLB: LVS + Nginx (layer 4/7 load balancing)

BFF: business‑logic composition

Problem: Retries at every layer amplify failures exponentially.

Solution: Retry only at the failing layer and return a standardized error code (e.g., “overload”) so callers stop further retries.

API‑level retries should consider backend overload; a global retry ratio (e.g., 10 %) can cap amplification.

Introduce random jitter and exponential back‑off to avoid synchronized retry spikes.

Separate retry statistics from normal QPS charts for clearer diagnosis.

Timeout Handling

Most failures stem from unreasonable timeout settings.

High‑latency downstream services can block client threads, causing request queues and OOM.

Timeouts aim to discard or consume slow requests.

Downstream returning after the upstream timeout wastes resources.

When a service must respond within 1 s, each layer should deduct its consumed time from the remaining budget, often using Go’s Context with the smallest remaining timeout.

Timeout budgets can be propagated via RPC metadata (e.g., “700 ms” passed to downstream services) or defined in IDL files.

Mitigating Chain Failures

Graceful degradation: initially personalized responses, later fallback to popular content only.

Q&A

Q: What metrics drive load balancing? A: Server‑side CPU and client‑side health score (connection success rate, latency) are normalized into a linear scoring equation.

Q: Does BFE to SLB use public network or dedicated lines? A: Both are used.

Q: Will thousands of ping‑pong checks per client cause high CPU? A: Yes, especially with many backends.

Q: Are there blocking points when switching clusters? A: Clients maintain connections to all clusters; subset algorithm and per‑cluster caches avoid blocking.

Q: How are load‑balancer probes implemented? A: Using penalty values (e.g., 5 s) to gradually increase traffic.

Q: Is there an open‑source Quota‑Server implementation? A: Current solutions focus on single‑node limits.

Q: Is client‑side statistics overhead too high? A: Sidecar or Service‑Mesh can collect metrics efficiently.

Q: Should timeout be strict? A: In some cases, RPC Context can allow work to continue even after timeout.

Q: Does measuring CPU per RPC cost a lot? A: A background thread periodically computes smoothed CPU averages.

Q: Why do online and test‑environment CPU profiles differ? A: RPC routing with shadow databases can cause variance.

Q: How to handle CC attacks? A: Edge nodes and core data centers detect traffic patterns and apply controls.

For readers interested in interview questions from major tech companies, scan the QR code or reply “ Interview Questions ” to receive the curated list.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend High Availability load balancing SRE retry rate limiting timeout

Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.