How Tencent’s AMS Achieved Fault Tolerance at Billion‑Request Scale
This article shares Tencent’s experience building fault‑tolerant mechanisms for the AMS activity platform, covering retry strategies, automatic machine exclusion, timeout tuning, service isolation, asynchronous processing, anti‑replay safeguards, and operational best practices that transformed a million‑request service into an 800‑million‑request system.
Introduction
More than three years ago the author was on‑call 24/7 for a Tencent activity‑operation system that suddenly faced a several‑fold traffic increase, causing numerous anomalies. The leader advised moving from a "firefighter" mindset to designing the system to automatically "extinguish" failures. The goal is to give the system inherent fault‑tolerance rather than relying on manual recovery.
The article focuses on practical fault‑tolerance techniques for engineers.
1. Retry Mechanism
The simplest fault‑tolerance method is retrying failed requests, but careless use can cause a "snowball" effect because each retry doubles the load on downstream services.
1.1 Simple Retry
If a service fails, retry once. For a service with a 99.9% success rate that drops to 95% during an anomaly, a single retry can raise the effective success rate to about 99.75%. However, if the service is truly down, the extra traffic may overwhelm it, especially when users repeatedly click a failing feature.
Simple retry should be applied only in appropriate scenarios; if the service success rate falls below a threshold, it may be better to skip retry to avoid traffic spikes.
1.2 Primary‑Backup Automatic Switch
Instead of retrying the same service, deploy two independent services (A and B). If A fails, the request is routed to B, preventing double load on a single service. Drawbacks include resource waste (the backup may sit idle), increased latency (the request takes at least twice as long), and the risk that both services fail under extreme load.
In the AMS platform this approach is used sparingly because primary‑backup alone is not reliable enough.
2. Dynamic Removal or Recovery of Abnormal Machines
AMS backs hundreds of services with stateless routing via an internal L5 router. Key practices:
Stateless routing eliminates single‑point failures.
Horizontal scaling adds machines during traffic spikes.
Automatic exclusion removes machines whose success rate falls below 50%; they are probed later and re‑added once healthy.
Example: a four‑machine pool (A, B, C, D) loses machine A; L5 removes A, serving traffic with B‑D. When A recovers, it is automatically reintegrated.
3. Timeout Management
3.1 Setting Reasonable Timeouts
Choosing appropriate request timeouts is crucial. A service with 100 ms average processing time and a 5 s timeout wastes a worker for the full timeout when a request hangs, dramatically reducing throughput. Reducing the timeout to 500 ms improves throughput but may increase failure rates for longer‑running requests.
3.2 Fast‑Slow Separation
Configure different timeouts per business need (e.g., 100 ms queries get 1 s timeout, 700 ms queries get 5 s). This avoids a one‑size‑fits‑all timeout that harms success rates.
3.3 Solving Synchronous Blocking
Even with fast‑slow separation, long‑running services still block threads. Using I/O multiplexing and coroutine‑based asynchronous callbacks frees threads while waiting for I/O, keeping overall throughput high.
Coroutines let developers write asynchronous logic in a synchronous style, reducing complexity while avoiding thread blockage.
3.4 Preventing Re‑entrancy and Duplicate Delivery
When a delivery request times out but later succeeds, users may click again, causing duplicate gifts. Solutions include:
Business‑level limits (one gift per user).
Order‑number mechanism to ensure one delivery per order.
Asynchronous delivery queues that acknowledge success to the user immediately and retry in the background.
Special anti‑brush logic that limits the number of read‑timeout retries before assuming success.
4. Service Degradation (Non‑Core Bypass)
For a gift‑claim request that passes through many services, non‑critical services (e.g., analytics) are given short timeouts (e.g., 20 ms). If they exceed the timeout, they are bypassed, allowing the core flow to continue.
5. Service Decoupling and Physical Isolation
Designing services to be small and independently deployable reduces coupling. Over three years AMS grew from a million‑request system to a platform handling 800 million requests, requiring extensive service splitting and isolation.
5.1 Service Splitting
Core services and storage were divided from a few monolithic components into dozens of independent services, improving stability and allowing independent scaling.
5.2 Light‑Heavy Separation and Multi‑Zone Deployment
Critical services (e.g., gift delivery) are deployed separately from lighter services (e.g., queries). Each cluster spans multiple data centers; if one center fails, the router removes its machines and continues serving from the remaining zones.
6. Business‑Level Fault Tolerance
Human errors such as mis‑configuring gift limits can cause large‑scale incidents. Automated configuration checks, validation scripts, and enforced test flows reduce reliance on manual verification.
A built‑in configuration‑checking system now runs dozens of rule sets, catching simple errors (e.g., daily gift limits) and complex logical inconsistencies before deployment.
7. Conclusion
Both machines and people make mistakes; with hundreds of servers or thousands of collaborators, errors become frequent. By building systems that tolerate machine failures and by automating processes that prevent human slip‑ups, teams can reduce on‑call fatigue and enjoy more stable, maintainable services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
