Operations 25 min read

Building Billion‑Scale Web Systems That Auto‑Extinguish Failures

The article shares Tencent’s practical fault‑tolerance journey for a billion‑scale activity platform, covering retry strategies, automatic removal of faulty nodes, timeout tuning, business‑level safeguards, service degradation, and decoupling techniques that together reduce manual firefighting and improve system resilience.

Efficient Ops

Feb 6, 2017

Building Billion‑Scale Web Systems That Auto‑Extinguish Failures

1. Introduction

More than three years ago, the author was responsible for an activity operation system at Tencent that faced numerous anomalies due to a several‑fold increase in traffic. He spent 7×24 hours handling alerts, often on weekends and nights.

His leader advised him to stop acting as a “firefighter” and start thinking about the root causes from a system‑wide perspective.

The author realized that constantly “extinguishing fires” is unsustainable; the system itself must be able to “auto‑extinguish” by having built‑in fault tolerance.

After three years the system grew from a million‑level daily request web service to a platform handling up to 800 million requests per peak day, illustrating a remarkable technical journey.

Fault tolerance is a key indicator of system robustness, and this article focuses on practical fault‑tolerance techniques.

Note: QQ membership activity platform, abbreviated as AMS.

2. Retry Mechanism

2.1 Simple Retry

The simplest fault‑tolerance method is “retry on failure”. While easy to implement, it can cause a “snowball” effect because each retry doubles the request load on the backend service.

Example: a service with 99.9 % success rate drops to 95 % due to a transient issue; a single retry can raise the effective success rate back to roughly 99.75 %.

However, if the service is truly problematic, the extra traffic may overwhelm it, leading to a system crash. In real business scenarios, users may repeatedly click a failing feature, amplifying the traffic surge.

Simple retries should be applied only in appropriate scenarios, possibly disabling retries when the service success rate falls below a threshold.

2.2 Primary‑Backup Automatic Switch

Instead of retrying the same service, the author describes using two independent services (A and B) to obtain an OpenID. If A fails, the request is automatically switched to B, avoiding double load on A.

Issues with this approach include resource waste (the backup may sit idle) and increased latency because the request must wait for the primary to fail before trying the backup.

In the AMS system this mechanism is used sparingly because a primary‑backup pair is not considered sufficiently reliable.

3. Dynamic Removal or Recovery of Faulty Machines

AMS backend consists of hundreds of stateless services registered with an internal intelligent routing service (L5). L5 automatically removes a machine when its success rate falls below 50 % and later reintegrates it after a successful probe.

Example: a service group with machines A, B, C, D; if A becomes unavailable, L5 removes it, leaving B‑C‑D to serve traffic. When A recovers, it is added back.

4. Timeout Settings

4.1 Reasonable Timeouts for Services and Storages

Setting appropriate timeouts is crucial. If a worker thread has a 5 s timeout but the average processing time is 100 ms, a single timeout blocks the worker for the full 5 s, drastically reducing throughput.

Reducing the timeout (e.g., to 500 ms) can improve throughput but may also lower success rates for longer‑running requests.

4.2 Short Timeouts Reduce Success Rate

Uniform short timeouts cause many normally successful requests to be treated as failures. The author suggests “fast‑slow separation”: configure different timeout values for fast services (e.g., 1 s) and slower services (e.g., 5 s).

4.3 Solving Synchronous Blocking

Even with fast‑slow separation, long‑running services still block threads. The solution is to use I/O multiplexing and asynchronous callbacks (coroutines) so that a thread can handle other requests while waiting for I/O.

5. Business‑Level Fault Tolerance

Beyond architectural safeguards, business logic must also be protected. Human errors such as misconfigured daily limits can cause large‑scale incidents.

AMS monitors activity metrics every ten minutes to detect anomalies, but prevention is better: the system validates configuration rules before an activity goes live.

A configurable “configuration check” engine now enforces dozens of business rules, reducing manual mistakes.

Additionally, the platform enforces a verification step where the activity owner must successfully claim all gifts using a designated QQ account, ensuring the workflow has been exercised.

6. Service Degradation – Automatic Shielding of Non‑Core Failures

Non‑core services (e.g., reporting) are given low timeout thresholds; if they exceed the timeout they are bypassed, allowing the core flow to continue.

7. Service Decoupling and Physical Isolation

Splitting a monolithic service into many small, independently deployed services reduces coupling and limits the blast radius of failures.

Core services and storages were gradually separated: from a few storage nodes to over twenty independent deployments.

Physical isolation ensures that hardware failures in one node do not affect others.

“Light‑heavy separation” further isolates critical services (e.g., gift‑delivery) from less critical ones (e.g., queries), allowing independent scaling and failure isolation.

8. Summary

Both machines and people make mistakes; in large‑scale systems the probability of error becomes significant. Systems should be designed to tolerate machine failures automatically and to prevent human errors through programmatic checks, thereby reducing reliance on on‑call engineers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations fault tolerance large-scale systems service reliability

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.