Operations 24 min read

How Tencent’s AMS Achieved Fault Tolerance at Billion‑Request Scale

This article shares Tencent’s experience building fault‑tolerant mechanisms for the AMS activity platform, covering retry strategies, automatic machine exclusion, timeout tuning, service isolation, asynchronous processing, anti‑replay safeguards, and operational best practices that transformed a million‑request service into an 800‑million‑request system.

21CTO

Apr 5, 2016

How Tencent’s AMS Achieved Fault Tolerance at Billion‑Request Scale

Introduction

More than three years ago the author was on‑call 24/7 for a Tencent activity‑operation system that suddenly faced a several‑fold traffic increase, causing numerous anomalies. The leader advised moving from a "firefighter" mindset to designing the system to automatically "extinguish" failures. The goal is to give the system inherent fault‑tolerance rather than relying on manual recovery.

The article focuses on practical fault‑tolerance techniques for engineers.

1. Retry Mechanism

The simplest fault‑tolerance method is retrying failed requests, but careless use can cause a "snowball" effect because each retry doubles the load on downstream services.

1.1 Simple Retry

If a service fails, retry once. For a service with a 99.9% success rate that drops to 95% during an anomaly, a single retry can raise the effective success rate to about 99.75%. However, if the service is truly down, the extra traffic may overwhelm it, especially when users repeatedly click a failing feature.

Simple retry should be applied only in appropriate scenarios; if the service success rate falls below a threshold, it may be better to skip retry to avoid traffic spikes.

1.2 Primary‑Backup Automatic Switch

Instead of retrying the same service, deploy two independent services (A and B). If A fails, the request is routed to B, preventing double load on a single service. Drawbacks include resource waste (the backup may sit idle), increased latency (the request takes at least twice as long), and the risk that both services fail under extreme load.

In the AMS platform this approach is used sparingly because primary‑backup alone is not reliable enough.

2. Dynamic Removal or Recovery of Abnormal Machines

AMS backs hundreds of services with stateless routing via an internal L5 router. Key practices:

Stateless routing eliminates single‑point failures.

Horizontal scaling adds machines during traffic spikes.

Automatic exclusion removes machines whose success rate falls below 50%; they are probed later and re‑added once healthy.

Example: a four‑machine pool (A, B, C, D) loses machine A; L5 removes A, serving traffic with B‑D. When A recovers, it is automatically reintegrated.

3. Timeout Management

3.1 Setting Reasonable Timeouts

Choosing appropriate request timeouts is crucial. A service with 100 ms average processing time and a 5 s timeout wastes a worker for the full timeout when a request hangs, dramatically reducing throughput. Reducing the timeout to 500 ms improves throughput but may increase failure rates for longer‑running requests.

3.2 Fast‑Slow Separation

Configure different timeouts per business need (e.g., 100 ms queries get 1 s timeout, 700 ms queries get 5 s). This avoids a one‑size‑fits‑all timeout that harms success rates.

3.3 Solving Synchronous Blocking

Even with fast‑slow separation, long‑running services still block threads. Using I/O multiplexing and coroutine‑based asynchronous callbacks frees threads while waiting for I/O, keeping overall throughput high.

Coroutines let developers write asynchronous logic in a synchronous style, reducing complexity while avoiding thread blockage.

3.4 Preventing Re‑entrancy and Duplicate Delivery

When a delivery request times out but later succeeds, users may click again, causing duplicate gifts. Solutions include:

Business‑level limits (one gift per user).

Order‑number mechanism to ensure one delivery per order.

Asynchronous delivery queues that acknowledge success to the user immediately and retry in the background.

Special anti‑brush logic that limits the number of read‑timeout retries before assuming success.

4. Service Degradation (Non‑Core Bypass)

For a gift‑claim request that passes through many services, non‑critical services (e.g., analytics) are given short timeouts (e.g., 20 ms). If they exceed the timeout, they are bypassed, allowing the core flow to continue.

5. Service Decoupling and Physical Isolation

Designing services to be small and independently deployable reduces coupling. Over three years AMS grew from a million‑request system to a platform handling 800 million requests, requiring extensive service splitting and isolation.

5.1 Service Splitting

Core services and storage were divided from a few monolithic components into dozens of independent services, improving stability and allowing independent scaling.

5.2 Light‑Heavy Separation and Multi‑Zone Deployment

Critical services (e.g., gift delivery) are deployed separately from lighter services (e.g., queries). Each cluster spans multiple data centers; if one center fails, the router removes its machines and continues serving from the remaining zones.

6. Business‑Level Fault Tolerance

Human errors such as mis‑configuring gift limits can cause large‑scale incidents. Automated configuration checks, validation scripts, and enforced test flows reduce reliance on manual verification.

A built‑in configuration‑checking system now runs dozens of rule sets, catching simple errors (e.g., daily gift limits) and complex logical inconsistencies before deployment.

7. Conclusion

Both machines and people make mistakes; with hundreds of servers or thousands of collaborators, errors become frequent. By building systems that tolerate machine failures and by automating processes that prevent human slip‑ups, teams can reduce on‑call fatigue and enjoy more stable, maintainable services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations System Design fault tolerance retry asynchronous processing timeout service isolation

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.